# Exploring HathiTrust Metadata without NLP or AI/ML

This is a Jupyter notebook for working with metadata downloaded from HathiTrust from a search with the following criteria:

    - Title: liber libri libris libro OR All Fields: opus opera operibus OR Title: carmen carmina carminibus
    - Language: (Latin)
    - Original Format: (Book)

## Working assumptions

- Some records will not be candidates for inclusion in the DLL Catalog because they are editions of Greek works. Such editions traditionally have Latin titles and introductorty materials.
- Some records will not have corresponding authority and/or work files in the DLL Catalog.
- The names of many authors will be present in many variant spellings and forms.
- The titles of works will be particularly difficult to parse, since they will be along the lines of _opera omnia_, vel sim.

In [1]:
'''
author: Samuel J. Huskey
'''
# Import the necessary modules
import pandas as pd
import re

In [None]:
# Read in the tab-delimited file downloaded from Hathi and turn it into a dataframe
df = pd.read_csv('../../data/1908698974-1722799169.txt', sep='\t')

In [3]:
# Examine the basic structure of the file
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24799 entries, 0 to 24798
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   htid                     24799 non-null  object 
 1   access                   24799 non-null  int64  
 2   rights                   24799 non-null  object 
 3   ht_bib_key               24799 non-null  int64  
 4   description              10074 non-null  object 
 5   source                   24799 non-null  object 
 6   source_bib_num           24729 non-null  object 
 7   oclc_num                 17400 non-null  object 
 8   isbn                     164 non-null    object 
 9   issn                     0 non-null      float64
 10  lccn                     3208 non-null   object 
 11  title                    24799 non-null  object 
 12  imprint                  24788 non-null  object 
 13  rights_reason_code       24799 non-null  object 
 14  rights_timestamp      

There are 24,799 entries in the data. I'll see how far I can go using just methods from Python and Pandas.

## Analysis of columns

In [90]:
# Set the display options to show all columns
pd.set_option('display.max_columns', None)
# Examine the first five rows
df.head()

Unnamed: 0,htid,access,rights,ht_bib_key,description,source,source_bib_num,oclc_num,isbn,issn,lccn,title,imprint,rights_reason_code,rights_timestamp,us_gov_doc_flag,rights_date_used,pub_place,lang,bib_fmt,collection_code,content_provider_code,responsible_entity_code,digitization_agent_code,access_profile_code,author,catalog_url,handle_url
0,aeu.ark:/13960/t25b10270,1,pd,100281057,,AEU,6264341,768320676,06654768259780665476822,,,"Historiæ canadensis, seu Novæ-Franciæ libri de...",Apud Sebastianum Cramoisy et Sebast. Mabre-Cra...,bib,2014-09-17 03:25:33,0,1664,fr,lat,BK,AEU,ualberta,ualberta,ia,open,"Du Creux, François, 1596?-1666.",https://catalog.hathitrust.org/Record/100281057,https://hdl.handle.net/2027/aeu.ark:/13960/t25...
1,aeu.ark:/13960/t5q82gg1r,1,pd,100288370,,AEU,6279449,861561317,06656169299780665616921,,,Ernesti Meyer de plantis labradoricis libri tres.,"Sumtibus Leopoldi Vossii, 1830.",bib,2014-09-17 03:26:35,0,1830,gw,lat,BK,AEU,ualberta,ualberta,ia,open,"Meyer, Ernst H. F. 1791-1858.",https://catalog.hathitrust.org/Record/100288370,https://hdl.handle.net/2027/aeu.ark:/13960/t5q...
2,aeu.ark:/13960/t6155m888,1,pd,100315300,,AEU,6374963,85791860,06659401069780665940101,,,"Novus orbis, seu Descriptionis Indiae Occident...","Apud Elzevirios, 1633.",bib,2014-09-19 03:25:59,0,1633,ne,lat,BK,AEU,ualberta,ualberta,ia,open,"Laet, Joannes de, 1593-1649.",https://catalog.hathitrust.org/Record/100315300,https://hdl.handle.net/2027/aeu.ark:/13960/t61...
3,aeu.ark:/13960/t6tx4326r,1,pd,100266272,,AEU,4964437,719990409,06653521079780665352102,,,C. Julii Cæsaris commentariorum De Bello Galli...,"Armour and Ramsay, 1849.",bib,2014-09-17 03:26:53,0,1849,quc,lat,BK,AEU,ualberta,ualberta,ia,open,"Caesar, Julius",https://catalog.hathitrust.org/Record/100266272,https://hdl.handle.net/2027/aeu.ark:/13960/t6t...
4,aeu.ark:/13960/t77s8mb8n,1,pd,100312296,,AEU,6368951,867440434,"066589693X,9780665896934",,,Collectanea latina seu ecclesiasticæ antiquita...,"[s.n.], 1853.",bib,2014-09-19 03:26:14,0,1853,onc,lat,BK,AEU,ualberta,ualberta,ia,open,,https://catalog.hathitrust.org/Record/100312296,https://hdl.handle.net/2027/aeu.ark:/13960/t77...


### Columns that could be jettisoned

- `access`: its value is always "1". 
- `rights` will always be "pd" (public domain). 
- `description`: are all values "NaN"?
- `issn`: are all values "NaN"?
- `lccn`: are all values "NaN"?
- `us_gov_doc_flag`: the values should be "0"
- `lang`: the search criteria specified "lat"

I'll check which columns have multiple or single values.

In [4]:
# Use nunique() to check the number of unique values in each column
unique_values = df.nunique()

# Identify columns with only one unique value
single_value_columns = unique_values[unique_values == 1].index.tolist()

# Print the results
print("Columns with multiple values:")
print(unique_values[unique_values > 1])
print("\nColumns with a single unique value:")
print(unique_values[unique_values == 1])

Columns with multiple values:
htid                       24799
rights                         4
ht_bib_key                 14917
description                 1831
source                        35
source_bib_num             16403
oclc_num                    9336
isbn                          41
lccn                        1095
title                      14482
imprint                    13111
rights_reason_code             7
rights_timestamp            9090
rights_date_used             508
pub_place                     97
collection_code               51
content_provider_code         35
responsible_entity_code       35
digitization_agent_code       16
access_profile_code            2
author                      6017
catalog_url                14917
handle_url                 24799
dtype: int64

Columns with a single unique value:
access             1
us_gov_doc_flag    1
lang               1
bib_fmt            1
dtype: int64


I can safely jettison `access`, `us_gov_doc_flag`, `lang`, and `bib_fmt`.

I'm not interested right now in `right`, `rights_reason_code`, `rights_timestamp`, `rights_date_used`, `collection_code`, `content_provider_code`, `responsible_entity_code`, `digitization_agent_code`, or `access_profile_code`.

In fact, as long as I have one unique identifier to tie the records to the original dataframe, I can eliminate most of the columns so that I can focus on authors and titles. The `handle_url` column is the only one with a unique value in each row, so I'll use that as the identifier.

I'll make a new dataframe with only the columns needed: `author`, `title`, `imprint`, `pub_place`, `rights_date_used` (a.k.a. publication date), and `handle_url`.

In [5]:
# Make a new dataframe with the required columns
hathidata = df[['author','title','imprint','pub_place','rights_date_used','handle_url']]

In [6]:
# Inspect the first five records
hathidata.head()

Unnamed: 0,author,title,imprint,pub_place,rights_date_used,handle_url
0,"Du Creux, François, 1596?-1666.","Historiæ canadensis, seu Novæ-Franciæ libri de...",Apud Sebastianum Cramoisy et Sebast. Mabre-Cra...,fr,1664,https://hdl.handle.net/2027/aeu.ark:/13960/t25...
1,"Meyer, Ernst H. F. 1791-1858.",Ernesti Meyer de plantis labradoricis libri tres.,"Sumtibus Leopoldi Vossii, 1830.",gw,1830,https://hdl.handle.net/2027/aeu.ark:/13960/t5q...
2,"Laet, Joannes de, 1593-1649.","Novus orbis, seu Descriptionis Indiae Occident...","Apud Elzevirios, 1633.",ne,1633,https://hdl.handle.net/2027/aeu.ark:/13960/t61...
3,"Caesar, Julius",C. Julii Cæsaris commentariorum De Bello Galli...,"Armour and Ramsay, 1849.",quc,1849,https://hdl.handle.net/2027/aeu.ark:/13960/t6t...
4,,Collectanea latina seu ecclesiasticæ antiquita...,"[s.n.], 1853.",onc,1853,https://hdl.handle.net/2027/aeu.ark:/13960/t77...


In [94]:
# Count the number of records. It should still be 24,799
len(hathidata)

24799

## Reconcile the Hathi Author values against the VIAF Alternates

The file `output/viaf-authors-output.csv` contains variant names for authors, one variant per row, with the DLL's author identifier in the adjacent column. This is essentially a lookup table, so I'll use it that way. I'll turn it into a Python dictionary. I'll use the Pandas `map()` method to look up an author's name from the `author` column in `hathidata`. If a value matches, I'll add the corresponding DLL ID to a new column in `hathidata`.

In [95]:
# Open the VIAF author data file
viafauthors = pd.read_csv('output/viaf-authors-output.csv')

In [96]:
# Create a dictionary from df2 for lookup
lookup_dict = pd.Series(viafauthors.Identifier.values, index=viafauthors['H2 Text']).to_dict()

In [97]:
# Make a new 'dll_id' column in hathidata and insert the DLL's Author ID, if there's a match
hathidata = hathidata.copy()
hathidata.loc[:, 'dll_author_id'] = hathidata['author'].map(lookup_dict)

I'll check on how many authors in `hathidata` were matched with a DLL ID.

In [99]:
# Use the count() method to count the number of rows with a value other than "NaN"
non_nan_count = hathidata['dll_author_id'].count()
print(f"Number of records matched: {non_nan_count}")
print(f"Number of records without a match: {len(hathidata) - non_nan_count}")

Number of records matched: 6313
Number of records without a match: 18486


6,313 rows out of 24,799 were matched with a DLL ID. I still need to reconcile 18,486 records!

I'd like to examine the authors who were not reconciled, to figure out if they aren't yet in the DLL's catalog, or if some other factor prevented the match.

In [101]:
# Use the isna() method to make a dataframe of records without a value in dll_author_id
unreconciled = hathidata[hathidata['dll_author_id'].isna()]

In [102]:
# Get basic information about the new dataframe
unreconciled.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18486 entries, 0 to 24798
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   author            17522 non-null  object
 1   title             18486 non-null  object
 2   imprint           18475 non-null  object
 3   pub_place         18486 non-null  object
 4   rights_date_used  18486 non-null  int64 
 5   handle_url        18486 non-null  object
 6   dll_author_id     0 non-null      object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


I want to see who these unidentified authors are. I'll use the `unique()` method to see who they are and to get a sense of whether some of the names have multiple rows.

In [103]:
# Make a list of unique author names that don't have a DLL ID
unique_unid = unreconciled['author'].unique()

In [104]:
# How long is it?
print(f"There are {len(unique_unid)} unique author names without a DLL ID")

There are 5665 unique author names without a DLL ID


Who are these 5,655 unidentified authors?

In [105]:
for author in unique_unid:
    print(author)

Du Creux, François, 1596?-1666.
Meyer, Ernst H. F. 1791-1858.
Laet, Joannes de, 1593-1649.
nan
Drexel, Jeremias, 1581-1638,
Kircher, Athanasius, 1602-1680
Acosta, José de, 1540-1600,
Lessius, Leonardus, 1554-1623
Riccioli, Giovanni Battista, 1598-1671,
Guazzo, Francesco Maria,
Mersenne, Marin, 1588-1648,
Roothaan, Joannes Philippus, 1785-1853
Arrian.
Euclid,
Suarez, Francisco, 1548-1617.
Alvares, Manuel, 1526-1583.
Huygens, Constantijn, 1596-1687
Nieremberg, Juan Eusebio, 1595-1658.
Bellarmino, Roberto Francesco Romolo, Saint, 1542-1621.
Reiske, Johann Jacob, 1716-1774.
Suárez, Francisco, 1548-1617.
Boethius, -524.
Prosper, of Aquitaine, Saint, approximately 390-approximately 463.
Vitruvius Pollio.
Thomas, Aquinas, Saint, 1225?-1274.
Torsellino, Orazio, 1545-1599.
Juvencus, Caius Vettius Aquilinus.
Hyginus, Gromaticus.
Horace
Rostowski, Stanislaw, 1711-1784.
Valerius Flaccus, Gaius, active 1st century.
Navarrete, Jean André, 1730-1811.
Caesar, Julius.
Herodotus.
Nonius Marcellus, act

I can see that some of the authors are not authors of Latin works. For example, "Hesiod" wrote in Greek. He has been included here, no doubt, because editions of Greek works often have titles and other information in Latin. It's clear that I need to winnow out some authors.

I also see that "nan" is in the list. "nan" is short for "Not a Number", which is Pandas' way of saying "this field is empty". That could make things more complicated.

I'll start by seeing how many rows are associated with each unidentified author.

In [106]:
# Count the rows associated with each unique author
author_counts = unreconciled['author'].value_counts()
author_counts

author
Cicero, Marcus Tullius                                                                           371
Livy                                                                                             297
Horace                                                                                           196
Tacitus, Cornelius                                                                               167
Virgil                                                                                           151
Pliny, the Elder.                                                                                150
Quintilian.                                                                                      150
Thomas, Aquinas, Saint, 1225?-1274.                                                              145
Aristotle.                                                                                       138
Suarez, Francisco, 1548-1617.                                                       

Something is clearly not right. Cicero, Livy, Horace, Tacitus, and Virgil should have been matched. Why weren't they? I guess "Cicero, Marcus Tullius" is not one of the alternative name forms in the DLL Catalog's authority file for Cicero. But "Virgil" should be in there for sure, since "Virgil" is the DLL Catalog's authorized name for Publius Vergilius Maro. (I'm just going to skip the whole "Virgil" or "Vergil" discussion for now.)

I want to take a closer look at "Virgil".

In [107]:
# Examine the entries for "Virgil"
unreconciled[unreconciled['author'] == 'Virgil']


Unnamed: 0,author,title,imprint,pub_place,rights_date_used,handle_url,dll_author_id
327,Virgil,Opera Ex recensione Chri. Gottl. Heynii.,"sumptibus et typis Instituti bibliographici, 1...",gw,1830,https://hdl.handle.net/2027/chi.085043156,
328,Virgil,P. Virgilii Maronis opera. Interpretatione et ...,impensis G. et S. Ginger [etc.] 1802.,enk,1802,https://hdl.handle.net/2027/chi.085043180,
329,Virgil,Eneid (libro II.) Testo; versione e note del p...,"L. Cappelli, [1912]",it,1912,https://hdl.handle.net/2027/chi.085043554,
493,Virgil,"P. Virgilii Maronis Opera, varietate lectionis...",Typis T. Rickaby; impensis T. Payne [etc.] 1793.,enk,1793,https://hdl.handle.net/2027/chi.095720116,
495,Virgil,"P. Virgilii Maronis Opera, varietate lectionis...",Typis T. Rickaby; impensis T. Payne [etc.] 1793.,enk,1793,https://hdl.handle.net/2027/chi.095720239,
642,Virgil,"P. Virgilii Maronis Opera, varietate lectionis...",Typis T. Rickaby; impensis T. Payne [etc.] 1793.,enk,1793,https://hdl.handle.net/2027/chi.16856987,
649,Virgil,"Publii Virgilii Maronis Opera; or, The works o...","Pratt, Oakley & co., 1859.",nyu,1859,https://hdl.handle.net/2027/chi.17743747,
785,Virgil,"P. Virgilii Maronis Opera, varietate lectionis...",Typis T. Rickaby; impensis T. Payne [etc.] 1793.,enk,1793,https://hdl.handle.net/2027/chi.43057341,
858,Virgil,Opera omnia ad optimorum librorum fidem recens...,"sumptibus et typis B. G. Teubneri, 1825.",gw,1825,https://hdl.handle.net/2027/chi.51749701,
859,Virgil,Opera. Locis parallelis illustravit Joannes Ge...,"sumptibus C. F. Himburgi, 1798.",gw,1798,https://hdl.handle.net/2027/chi.51749903,


In [108]:
# Is "Virgil" in the DLL dataset?
search_string = "Virgil"
is_present_in_author_name = viafauthors['H2 Text'].str.contains(search_string).any()
print(f'The string "{search_string}" is present in the lookup table: {is_present_in_author_name}')

The string "Virgil" is present in the lookup table: True


Okay, so "Virgil" is in the lookup table. Maybe there's an issue with white space? I'll strip it from both and try again.

In [109]:
# Trim whitespace from both DataFrames using `strip()`
hathidata.loc[:,'author'] = hathidata['author'].str.strip()
viafauthors.loc[:,'H2 Text'] = viafauthors['H2 Text'].str.strip()

# Create a dictionary from lookup_df for lookup
lookup_dict = pd.Series(viafauthors.Identifier.values, index=viafauthors['H2 Text']).to_dict()

# Map the author names to the identifiers using the lookup dictionary
hathidata.copy()
hathidata['dll_author_id'] = hathidata['author'].map(lookup_dict)

# Use the count() method to count the number of rows with a value other than "NaN"
non_nan_count = hathidata['dll_author_id'].count()
print(f"There are {non_nan_count} authors without a DLL identifier.")


There are 6313 authors without a DLL identifier.


Hmmm. That's the same as the first time, so white space isn't the answer. I'll try using a "fuzzy matching" approach.

In [88]:
# Import the necessary modules
from rapidfuzz import process, fuzz

In [110]:
# Function to perform fuzzy matching
def fuzzy_match_author(author, lookup_dict, threshold=90):
    if pd.isna(author):
        return None
    match = process.extractOne(author, lookup_dict.keys(), scorer=fuzz.token_sort_ratio)
    if match and match[1] >= threshold:
        return lookup_dict[match[0]]
    return None

# Apply fuzzy matching to the author names
unreconciled.loc[:,'dll_author_id'] = unreconciled['author'].apply(fuzzy_match_author, args=(lookup_dict,))

That step takes between seven and nine minutes to run on my MacBook Pro, depending on what processes are running at the same time. Let's take a look at the author counts again. First I'll make a new dataframe of the authors who are still unreconciled, then I'll check on their numbers.

In [111]:
# Make a new dataframe of the authors still without a match
still_unreconciled = unreconciled[unreconciled['dll_author_id'].isna()]

# Count the rows associated with each unique author
still_unreconciled['author'].value_counts()

author
Livy                                                                                             297
Thomas, Aquinas, Saint, 1225?-1274.                                                              145
Plato.                                                                                           105
Thucydides.                                                                                       86
Xenophon.                                                                                         79
Statius, P. Papinius                                                                              64
Pindar.                                                                                           63
Herodotus                                                                                         60
Livio, Tito, ca. 60-17 a. C.                                                                      51
Tomás de Aquino, Santo, 1225?-1274                                                  

Virgil and some of the others are gone, but Livy is still there, along with many others. Also, I'm shocked and ashamed to say that Thomas Aquinas doesn't have an authority record in the DLL Catalog! The Greek authors are also still in the mix. 

On closer inspection, I noticed that several names are followed by a period, and that's preventing them from being matched, even with fuzzy matching. Interesting! I'll see what happens if I strip the final period from the "author" field.

I'll use the `rstrip()` method to remove the terminal period.

In [113]:
# Use `rstrip()` to remove the terminal period from author names
still_unreconciled.loc[:,'author'] = still_unreconciled['author'].str.rstrip('.')

# How many records are there in still_unreconciled?
print(f"There are {len(still_unreconciled)} records in the still_unreconciled dataframe.")

There are 13092 records in the still_unmatched dataframe.


I'd like to see what happens if I use fuzzy matching on the "still_unreconciled" dataframe now.

In [114]:
# Make a new dataframe called "fuzzied"
fuzzied2 = still_unreconciled.copy()

# Apply fuzzy matching to the author names
fuzzied2.loc[:,'dll_author_id'] = fuzzied2['author'].apply(fuzzy_match_author, args=(lookup_dict,))

In [115]:
print(f"There are {len(fuzzied2)} records in the fuzzied2 dataframe.")

There are 13092 records in the fuzzied2 dataframe.


That didn't accomplish anything!

What about removing the Greek authors? I can use the output of `.value_counts()` to build a list of names to filter.

I'll start by finding out how many authors have more than 50 items associated with them.

In [117]:
# Examine the number of rows per author after the terminal period has been stripped and fuzzy matching applied
still_unreconciled['author'].value_counts()

author
Livy                                                                                            297
Thomas, Aquinas, Saint, 1225?-1274                                                              151
Thucydides                                                                                      114
Plato                                                                                           105
Xenophon                                                                                        103
Herodotus                                                                                        96
Pindar                                                                                           86
Tomás de Aquino, Santo, 1225?-1274                                                               67
Juvenal                                                                                          66
Statius, P. Papinius                                                                         

In [118]:
# Define the threshold for filtering
threshold = 50

# Get the value counts for the 'author' column
value_counts = still_unreconciled['author'].value_counts()

# Get the authors with more than 'threshold' occurrences
more_than_fifty = value_counts[value_counts > threshold].index

# Filter the original DataFrame to include only rows with these authors
unreconciled_more_than_fifty = still_unreconciled[still_unreconciled['author'].isin(more_than_fifty)]

# Print the result to the screen 
unreconciled_more_than_fifty['author'].value_counts()

author
Livy                                  297
Thomas, Aquinas, Saint, 1225?-1274    151
Thucydides                            114
Plato                                 105
Xenophon                              103
Herodotus                              96
Pindar                                 86
Tomás de Aquino, Santo, 1225?-1274     67
Juvenal                                66
Statius, P. Papinius                   64
Philo, of Alexandria                   63
Aristophanes                           52
Livio, Tito, ca. 60-17 a. C            51
Name: count, dtype: int64

What about authors with between 10 and 50 works?

In [119]:
# Define the range for filtering
lower_bound = 10
upper_bound = 50

# Get the value counts for the 'author' column
value_counts = still_unreconciled['author'].value_counts()

# Get the authors with counts within the specified range
in_range = value_counts[(value_counts >= lower_bound) & (value_counts <= upper_bound)].index

# Filter the original DataFrame to include only rows with these authors
more_than_10 = still_unreconciled[still_unreconciled['author'].isin(in_range)]

pd.set_option('display.max_rows', None)
# Print the result to the screen
more_than_10['author'].value_counts()

author
Benedict XIV, Pope, 1675-1758                                 49
Demosthenes                                                   48
Amat de Graveson, Ignace Hyacinthe (O.P.), 1670-1733          48
John Chrysostom, Saint, -407                                  47
Athenaeus, of Naucratis                                       46
Menochio, Giacomo                                             43
Bellarmino, Roberto Francesco Romolo, Saint, 1542-1621        43
Euclid                                                        42
Denis the Carthusian, 1402-1471                               40
Petau, Denis, 1583-1652                                       39
Florus, Lucius Annaeus                                        37
Diodorus, Siculus                                             37
Barbosa, Agostino, 1590-1649                                  37
Catholic Church                                               36
Reiske, Johann Jacob, 1716-1774                               35
Galeno            

Okay: Authors with between 2 and 9 works.

In [120]:
# Define the range for filtering
lower_bound = 2
upper_bound = 9

# Get the value counts for the 'author' column
value_counts = still_unreconciled['author'].value_counts()

# Get the authors with counts within the specified range
in_range = value_counts[(value_counts >= lower_bound) & (value_counts <= upper_bound)].index

# Filter the original DataFrame to include only rows with these authors
fewer_than_10 = still_unreconciled[still_unreconciled['author'].isin(in_range)]

pd.set_option('display.max_rows', None)
# Print the result to the screen
fewer_than_10['author'].value_counts()

author
Coccejus, Johannes, 1603-1669                                                         9
Mayans y Siscar, Gregorio, 1699-1781                                                  9
Benedicto XIV, Papa, 1675-1758                                                        9
Turrettini, François, 1623-1687                                                      9
Bretschneider, Karl Gottlieb, 1776-1848                                               9
Llera, Matías de                                                                      9
Devarius, Matthaeus, b. 1505?                                                         9
Fernández de Retes, José                                                              9
Orosio, Paulo, n. 390?-m. 418?                                                        9
Jesuitas                                                                              9
Lomm, Joost van                                                                       9
Herodian                 

I think it's time to do a manual review of the list of unreconciled authors. I'm going make a list of the names of Greek authors in the `still_unreconciled` dataframe.

In [121]:
# Make a list of the unreconciled author names
unreconciled_authors_list = still_unreconciled['author'].to_list()
# Use set() to eliminate duplicates
unreconciled_authors = set(unreconciled_authors_list)
# Make a list of from the set
unique_unreconciled_authors = list(unreconciled_authors)
# Make sure that each name is actually a string
string_names = [str(name) for name in unique_unreconciled_authors]
# Sort the list alphabetically
sorted_names = sorted(string_names)
for name in sorted_names:
    print(name)

Abad, Diego José, 1727-1779
Abadía de Santillana del Mar
Abati, Baldo Angelo
Abaunza, Pedro de 1599-1649
Abbatius, Baldus Angelus, 16th cent
Abbeloos, J. B. 1836-1896
Abbeloos, Jean Baptiste, 1836-1906
Abdias, Obispo de Babilonia
Abicht, Rudolf, 1850-1921
Abrahams, Nicolai Christian Levin, 1798-1870
Abril, Pedro Simón, ca. 1530- ca. 1595
Abu al-Faraj al-Isbahani, 897 or 8-967
Abū Miḥjan al-Thaqafī, active 629-637
Abū Miḥjan al-Thaqafī, fl. 629-637
Abū Tammām Ḥabīb ibn Aws al-Ṭāʼī, active 808-842
Abū Tammām Ḥabīb ibn Aws al-Ṭāʾī, fl. 808-842,
Abū al-Rabīʻ Sulaymān ibn ʻAbd Allāh al-Muwaḥḥid
Abū ʻUbayd al-Qāsim ibn Sallām, approximately 773-approximately 837
Abū al-Faraj al-Iṣbahānī, 897 or 898-967
Academia Molshemensis (Francia)
Accademia degli Occulti (Brescia)
Acevedo, Alfonso de, 1518-1598
Achilles Tatius
Achillini, Alessandro
Achillini, Alessandro, 1463-1512
Acidalius, Valens, 1567-1595
Ackermann, Johann Christian Gottlieb, 1756-1801
Ackermann, Petrus F

After going through that list manually (the process took about an hour), I have assembled the following list of authors to be omitted from the data:

- Agathias, d. 582
- Alciphron
- Anacreon
- Apollodorus
- Apollonius, Dyscolus, 2nd cent
- Apollonius, Dyscolus, active 2nd century
- Apollonius, Rhodius
- Apolodoro de Atenas
- Archimedes
- Aristophanes
- Arrian
- Arriano, Flavio
- Artemidoro
- Athenaeus, of Naucratis
- Bacchylides
- Bion, of Phlossa near Smyrna
- Cassius Dio Cocceianus
- Cleomedes
- Constantine VII Porphyrogenitus, Emperor of the East, 905-959
- Cyril, Saint, Bishop of Jerusalem, approximately 315-386
- Demosthenes
- Dio Chrysostomus
- Diodorus, Siculus
- Diogenes Laertius
- Dion Casio
- Dionisio de Halicarnaso, ca. 60-5 a.C
- Dionysius Cisterciensis
- Dionysius, of Halicarnassus
- Diógenes Laercio
- Dión Casio
- Elias, of Nisibis, 975-1046
- Epictetus
- Euclid
- Euclides
- Euripides
- Eusebio de Cesarea, Obispo de Cesarea, ca. 265-ca. 340
- Eusebius, of Caesarea, Bishop of Caesarea, ca. 260-ca. 340.
- Eustathius, Macrembolites, 12th cent
- Galen
- Galeno
- Gregory, of Nazianzus, Saint
- Gregory, of Nyssa, Saint, ca. 335-ca. 394
- Hero of Alexandria
- Herodian
- Herodotus
- Heródoto, 484-425 a. C
- Hesiod
- Iamblichus, approximately 250-approximately 330
- Iamblichus, ca. 250-ca. 330
- Irenaeus, Saint, Bishop of Lyon
- Isocrates
- John Chrysostom, Saint, d. 407
- John VI Cantacuzenus, Emperor of the East, 1292-1383
- Juliano, Emperador de Roma, 331-363
- Justin, Martyr, Saint
- Justino, Santo, 100?-165?
- Libanius
- Lydus, Ioannes Laurentius, 490-
- Methodius, of Olympus, Saint, -311
- Michael, of Ephesus
- Nicander, of Colophon
- Nicephorus Callistus, ca. 1256-1335
- Nicephorus, Blemmydes, 1197-1272
- Orpheus
- Philo, of Alexandria
- Pindar
- Pindarus
- Plato
- Platón, ca. 427-348 a.C
- Plotinus
- Plutarch
- Polyaenus
- Procopius
- Píndaro, ca. 518-ca. 438 a. C
- Quintus, Smyrnaeus, 4th cent
- Sappho
- Sextus, Empiricus
- Simplicio
- Simplicius, of Cilicia
- Stobaeus
- Strabo
- Temistio
- Teodoreto, Obispo de Ciro
- Teofrasto
- Themistius
- Theocritus
- Theon, of Smyrna
- Theophilus, Saint, active 2nd century
- Theophrastus
- Thucydides
- Tryphiodorus
- Tucídides, ca. 460-ca. 400 a. C
- Xenophon
- Xenophon, of Ephesus
- Yamblico

I'll make that into a list and use it to filter the original dataframe.

In [122]:
# Provide the raw list of names as a string
names_to_be_omitted = """Agathias, d. 582
Alciphron
Anacreon
Apollodorus
Apollonius, Dyscolus, 2nd cent
Apollonius, Dyscolus, active 2nd century
Apollonius, Rhodius
Apolodoro de Atenas
Archimedes
Aristophanes
Arrian
Arriano, Flavio
Artemidoro
Athenaeus, of Naucratis
Bacchylides
Bion, of Phlossa near Smyrna
Cassius Dio Cocceianus
Cleomedes
Constantine VII Porphyrogenitus, Emperor of the East, 905-959
Cyril, Saint, Bishop of Jerusalem, approximately 315-386
Demosthenes
Dio Chrysostomus
Diodorus, Siculus
Diogenes Laertius
Dion Casio
Dionisio de Halicarnaso, ca. 60-5 a.C
Dionysius Cisterciensis
Dionysius, of Halicarnassus
Diógenes Laercio
Dión Casio
Elias, of Nisibis, 975-1046
Epictetus
Euclid
Euclides
Euripides
Eusebio de Cesarea, Obispo de Cesarea, ca. 265-ca. 340
Eusebius, of Caesarea, Bishop of Caesarea, ca. 260-ca. 340.
Eustathius, Macrembolites, 12th cent
Galen
Galeno
Gregory, of Nazianzus, Saint
Gregory, of Nyssa, Saint, ca. 335-ca. 394
Hero of Alexandria
Herodian
Herodotus
Heródoto, 484-425 a. C
Hesiod
Iamblichus, approximately 250-approximately 330
Iamblichus, ca. 250-ca. 330
Irenaeus, Saint, Bishop of Lyon
Isocrates
John Chrysostom, Saint, d. 407
John VI Cantacuzenus, Emperor of the East, 1292-1383
Juliano, Emperador de Roma, 331-363
Justin, Martyr, Saint
Justino, Santo, 100?-165?
Libanius
Lydus, Ioannes Laurentius, 490-
Methodius, of Olympus, Saint, -311
Michael, of Ephesus
Nicander, of Colophon
Nicephorus Callistus, ca. 1256-1335
Nicephorus, Blemmydes, 1197-1272
Orpheus
Philo, of Alexandria
Pindar
Pindarus
Plato
Platón, ca. 427-348 a.C
Plotinus
Plutarch
Polyaenus
Procopius
Píndaro, ca. 518-ca. 438 a. C
Quintus, Smyrnaeus, 4th cent
Sappho
Sextus, Empiricus
Simplicio
Simplicius, of Cilicia
Stobaeus
Strabo
Temistio
Teodoreto, Obispo de Ciro
Teofrasto
Themistius
Theocritus
Theon, of Smyrna
Theophilus, Saint, active 2nd century
Theophrastus
Thucydides
Tryphiodorus
Tucídides, ca. 460-ca. 400 a. C
Xenophon
Xenophon, of Ephesus
Yamblico"""

# Split the string by line breaks to create a list
names_to_omit_list = names_to_be_omitted.splitlines()

# Make a new dataframe without those authors
no_greek = still_unreconciled[~still_unreconciled['author'].isin(names_to_omit_list)]

In [124]:
# How many authors were eliminated by removing those Greek authors?
print(f"That operation eliminated {len(still_unreconciled) - len(no_greek)} items.")

That operation eliminated 1403 items.


In [125]:
# Compare the lengths of the dataframes
print (f"Length of original HathiTrust dataframe: {len(hathidata)}")
print (f"Length of dataframe with no matches: {len(still_unreconciled)}")
print (f"Length of dataframe without Greek authors: {len(no_greek)}")


Length of original HathiTrust dataframe: 24799
Length of dataframe with no matches: 13092
Length of dataframe without Greek authors: 11689


11,689 is still a lot of records to work through manually.

What do I know at this point?

1. The DLL Catalog lacks some common names (e.g., "Livy", "Juvenal") for authors who already have authority files
2. The DLL Catalog has some embarrassing gaps (e.g., Aquinas)
3. There's a very long list of "singletons", or authors with only one record

After I've added the names in items 1 & 2 to the DLL Catalog, there will still be thousands of unreconciled records. There's not much to be done about #3. They'll have to be reviewed and added to the catalog, if appropriate.

But I should at least generate a final version of the original data with as many reconciled rows as possible. To do that, I'll run all of the operations together on a copy of the original `hathidata` dataframe.

In [126]:
# Make a copy of the hathidata dataframe
hathidata_copy = hathidata.copy()
# Trim whitespace from the "author" column
hathidata_copy.loc[:,'author'] = hathidata_copy['author'].str.strip()
# Trim terminal period from the "author column"
hathidata_copy.loc[:,'author'] = hathidata_copy['author'].str.rstrip('.')
# Match as many records as possible without fuzzy matching
hathidata_copy.loc[:, 'dll_author_id'] = hathidata_copy['author'].map(lookup_dict)
# Use the count() method to count the number of rows with a value other than "NaN"
non_nan_count = hathidata_copy['dll_author_id'].count()
# Print the result to the screen
print(f"There are {non_nan_count} rows without a DLL identifier after straight matching.")
# Create a mask for rows where dll_author_id is NaN or an empty string
mask = hathidata_copy['dll_author_id'].isna() | (hathidata_copy['dll_author_id'] == '')
# Apply the fuzzy_match_author function only to the rows where the mask is True
hathidata_copy.loc[mask, 'dll_author_id'] = hathidata_copy.loc[mask, 'author'].apply(fuzzy_match_author, args=(lookup_dict,))
# Use the count() method to count the number of rows with a value other than "NaN"
non_nan_count = hathidata_copy['dll_author_id'].count()
# Print the result to the screen
print(f"There are {non_nan_count} rows without a DLL identifier after straight and fuzzy matching.")

There are 4332 authors without a DLL identifier after straight matching.
There are 11196 authors without a DLL identifier after straight and fuzzy matching.


Okay, I know how many rows lack a DLL identifier. But how many authors is that?

I'll make a new dataframe of rows without a DLL identifier, then I'll use nunique() to count the authors.

In [130]:
# Make the new dataframe
unreconciled = hathidata_copy[hathidata_copy['dll_author_id'].isna()]
# Use nunique() to count the unique values in the "author" column
print(f"Out of the original {hathidata_copy['author'].nunique()} authors, {unreconciled['author'].nunique()} remain unreconciled.")

Out of the original 5365 authors, 4513 remain unreconciled.


That seems like a bad result. But how many of those are "singletons"? 

In [139]:
# Group by author and count unique titles
author_title_counts = unreconciled.groupby('author')['title'].nunique()

# Filter authors with only one title
authors_with_one_title = author_title_counts[author_title_counts == 1]

# Count the number of such authors
count_of_authors_with_one_title = authors_with_one_title.count()
print(f"Out of {unreconciled['author'].nunique()} unreconciled authors, {count_of_authors_with_one_title} are 'singletons'.")
print(f"Percentage of unreconciled authors that are singletons: {count_of_authors_with_one_title/unreconciled['author'].nunique()}.")

Out of 4513 unreconciled authors, 3344 are 'singletons'.
Percentage of unreconciled authors that are singletons: 0.7409705295812098.


That's not terrible, I guess. 75% of the unreconciled records are singletons, which means that the odds were pretty good that they weren't in the catalog in the first place.