# Exploring HathiTrust Metadata with NLP and Machine Learning with SpaCY

This is a Jupyter notebook for working with metadata downloaded from HathiTrust from a search with the following criteria:

    - Title: liber libri libris libro OR All Fields: opus opera operibus OR Title: carmen carmina carminibus
    - Language: (Latin)
    - Original Format: (Book)

## Working assumptions

- Some records will not be candidates for inclusion in the DLL Catalog because they are editions of Greek works. Such editions traditionally have Latin titles and introductorty materials.
- Some records will not have corresponding authority and/or work files in the DLL Catalog.
- The names of many authors will be present in many variant spellings and forms.
- The titles of works will be particularly difficult to parse, since they will be along the lines of _opera omnia_, vel sim.

## Approach

Having used purely Python and Pandas methods in a previous attempt, in this notebook I'll see what can be accomplished using SpaCY, a Python library for Natural Language Processing.

In [11]:
# Load the necessary libraries
import pandas as pd
import spacy
from spacy.training.example import Example
from spacy.training.iob_utils import offsets_to_biluo_tags

In [17]:
# Read in the CSV file with variant names from the DLL Catalog's authority records
df = pd.read_csv('input/variant-names.csv')

In [18]:
# Inspect the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3229 entries, 0 to 3228
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   DLL Identifier                  3227 non-null   object 
 1   Authorized Name                 3227 non-null   object 
 2   Author Name English             1252 non-null   object 
 3   Author Name Latin               2197 non-null   object 
 4   Author Name Native Language     1045 non-null   object 
 5   BNE URL                         562 non-null    object 
 6   BNF URL                         1361 non-null   object 
 7   DNB URL                         2068 non-null   object 
 8   ICCU URL                        407 non-null    object 
 9   ISNI URL                        1983 non-null   object 
 10  Other Alternative Name Form(s)  38 non-null     object 
 11  Perseus Name                    827 non-null    object 
 12  Wikidata URL                    0 

In [19]:
# Function to extract training data from the DataFrame
def create_training_data(df):
    training_data = []
    for _, row in df.iterrows():
        standard_name = row['Authorized Name']
        alternative_names = row[2:6].dropna().tolist()  # Drop NaN values and convert to list
        all_names = [standard_name] + alternative_names
        for name in all_names:
            if not isinstance(name, str):  # Ensure the value is a string
                continue
            start_idx = 0
            entities = []
            for part in name.split():
                end_idx = start_idx + len(part)
                entities.append((start_idx, end_idx, "PERSON"))
                start_idx = end_idx + 1  # +1 for the space
            training_data.append((name, {"entities": entities}))
    return training_data

training_data = create_training_data(df)
print(training_data[:5])  # Print the first 5 training examples

[('Albert, of Aachen, active 11th-12th century', {'entities': [(0, 7, 'PERSON'), (8, 10, 'PERSON'), (11, 18, 'PERSON'), (19, 25, 'PERSON'), (26, 35, 'PERSON'), (36, 43, 'PERSON')]}), ('Albert of Aix-la-Chapelle/Albert of Aachen', {'entities': [(0, 6, 'PERSON'), (7, 9, 'PERSON'), (10, 32, 'PERSON'), (33, 35, 'PERSON'), (36, 42, 'PERSON')]}), ('Albertus Aquensis', {'entities': [(0, 8, 'PERSON'), (9, 17, 'PERSON')]}), ("Albert d'Aix-la-Chapelle", {'entities': [(0, 6, 'PERSON'), (7, 24, 'PERSON')]}), ('Marullo Tarcaniota, Michele', {'entities': [(0, 7, 'PERSON'), (8, 19, 'PERSON'), (20, 27, 'PERSON')]})]


In [20]:
# Initialize a blank NER model
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")

# Add the PERSON label to the NER model
ner.add_label("PERSON")

# Verify and correct entity alignment
def verify_and_correct_alignment(nlp, training_data):
    corrected_training_data = []
    for text, annotations in training_data:
        doc = nlp.make_doc(text)
        try:
            tags = offsets_to_biluo_tags(doc, annotations['entities'])
            corrected_entities = [(start, end, label) for (start, end, label), tag in zip(annotations['entities'], tags) if tag != '-']
            corrected_training_data.append((text, {"entities": corrected_entities}))
        except Exception as e:
            print(f"Error in alignment: {e}")
            continue
    return corrected_training_data

training_data = verify_and_correct_alignment(nlp, training_data)

# Convert the training data to spaCy's Example format
examples = [Example.from_dict(nlp.make_doc(text), annotations) for text, annotations in training_data]



In [21]:
# Create an optimizer
optimizer = nlp.begin_training()

# Train the NER model
for i in range(20):  # Number of training iterations
    losses = {}
    nlp.update(examples, drop=0.5, sgd=optimizer, losses=losses)
    print(f"Iteration {i}: Losses {losses}")

# Save the model
nlp.to_disk("custom_ner_model")

# Load the trained model
nlp = spacy.load("custom_ner_model")

# Test the model with a sample text
test_text = "Virgil"
doc = nlp(test_text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Iteration 0: Losses {'ner': 27583.331735253334}
Iteration 1: Losses {'ner': 26159.876992583275}
Iteration 2: Losses {'ner': 25988.22890651226}
Iteration 3: Losses {'ner': 25131.86690801382}
Iteration 4: Losses {'ner': 24238.963100135326}
Iteration 5: Losses {'ner': 23550.002394080162}
Iteration 6: Losses {'ner': 22483.592281639576}
Iteration 7: Losses {'ner': 21491.744533151388}
Iteration 8: Losses {'ner': 21078.74868634343}
Iteration 9: Losses {'ner': 20057.840320557356}
Iteration 10: Losses {'ner': 19570.887015789747}
Iteration 11: Losses {'ner': 19197.53355795145}
Iteration 12: Losses {'ner': 17831.028558790684}
Iteration 13: Losses {'ner': 17661.25188705325}
Iteration 14: Losses {'ner': 16407.887051343918}
Iteration 15: Losses {'ner': 15755.708585716784}
Iteration 16: Losses {'ner': 15273.423639848828}
Iteration 17: Losses {'ner': 14983.133277595043}
Iteration 18: Losses {'ner': 15690.668749511242}
Iteration 19: Losses {'ner': 15240.339127972722}
Virgil PERSON


In [24]:
# Test the model with a sample text
test_text = "Vergil."
doc = nlp(test_text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Vergil. PERSON


In [28]:
df2 = df[['Authorized Name','DLL Identifier']]
lookup_dict = df2.set_index('Authorized Name')['DLL Identifier'].to_dict()

In [32]:
# Load the HathiTrust metadata
df3 = pd.read_csv('input/1908698974-1722799169.txt', sep='\t')
# Make a new dataframe with the required columns
hathidata = df3[['author','title','imprint','pub_place','rights_date_used','handle_url']]

In [38]:
# Use the lookup_dict created from the DataFrame with "DLL Identifier" and "Authorized Name"

# Function to identify named entities and match to DLL identifiers
def identify_and_match_author(nlp, text, lookup_dict):
    if pd.isna(text):
        return None
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            matched_name = ent.text
            return lookup_dict.get(matched_name)
    return None

# Apply the function to the 'author' column and create a new column for DLL identifiers
hathidata.loc[:,'dll_identifier'] = hathidata['author'].apply(lambda x: identify_and_match_author(nlp, x, lookup_dict))

# Make the new dataframe
unreconciled = hathidata[hathidata['dll_identifier'].isna()]
# Use nunique() to count the unique values in the "author" column
print(f"Out of the original {hathidata['author'].nunique()} authors, {unreconciled['author'].nunique()} remain unreconciled.")


Out of the original 6017 authors, 5988 remain unreconciled.


In [40]:
identified = hathidata[~hathidata['dll_identifier'].isna()]
identified['author'].unique()

array(['Horace', 'Commodianus.', 'Livy', 'Persius', 'Phaedrus',
       'Tibullus', 'Virgil', 'Homer', 'Horace.', 'Martial', 'Apuleius.',
       'Commodianus', 'Origen', 'Aristotle', 'Quintilian', 'Martial.',
       'Persius.', 'Eugippius.', 'Apponius.', 'Apicius.', 'Juvenal',
       'Apuleius', 'Lygdamus.', 'Apponius', 'Terence', 'Gaius.',
       'Atilius Fortunatianus, active 4th century.', 'Defensor de Ligugé',
       'Atilius Fortunatianus, 4th cent.'], dtype=object)

JUVENAL and Livy were identified. Weird.