# Analysis of Output from Hybrid Approach

In [1]:
# Import the Pandas library
import pandas as pd

# Read in the output data
df = pd.read_csv('../output/output_df.csv',encoding='utf-8',quotechar='"')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13491 entries, 0 to 13490
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   author                       13491 non-null  object 
 1   deterministic_author         4339 non-null   object 
 2   fuzzy_author                 5674 non-null   object 
 3   fuzzy_author_score           13491 non-null  float64
 4   distilbert_author            13491 non-null  object 
 5   distilbert_author_score      13491 non-null  float64
 6   title                        13491 non-null  object 
 7   matched_title_deterministic  13491 non-null  object 
 8   matched_title_fuzzy          13491 non-null  object 
 9   fuzzy_title_score            13491 non-null  float64
dtypes: float64(3), object(7)
memory usage: 1.0+ MB


## First impressions

Just looking at the output of the `info()` method, I see that there are 13,491 total rows in the dataframe. That's down from over 20,000 in the original, thanks to the deduplication I was able to do in the `prepare_hathi.ipynb` notebook. 

Some of the columns do not have data in all of the rows. Specifically, `deterministic_author` and `fuzzy_author` have empty cells. That suggests that the deterministic method is the most "conservative" approach, and it looks like the fuzzy matching approach comes in second in that regard.

In [2]:
df.describe()

Unnamed: 0,fuzzy_author_score,distilbert_author_score,fuzzy_title_score
count,13491.0,13491.0,13491.0
mean,0.415911,0.790084,0.151044
std,0.488386,0.298975,0.326466
min,0.0,0.082294,0.0
25%,0.0,0.568069,0.0
50%,0.0,0.999766,0.0
75%,1.0,0.999998,0.0
max,1.0,1.0,0.9


The `describe()` method only covers two author matching methods, but it tells its own story. In previous analyses, the fuzzy matching method appeared to be slightly more liberal, with a 86.04% matching average, while the DistilBERT model had an 82.29% average. Now, with fewer records to begin with, the mean fuzzy author score is much lower: 41.6%. The mean distilbert matching score is also lower, by a little: 79%.

More revealing is the quartile output. The fuzzy matching method has nothing until the 75th percentile, when it reports 100% matching. That's a big change from previous attempts, when 25% of the records registered an 85.5% match from the fuzzy method, and the DistilBERT model returned 68.9% at that level. No the DistilBERT method achieved only 56.8% at the 25th percentile, but it rose dramatically at the 50th percentile, to 99.97%.

I attribute the difference to the fact that I raised the cutoff values for the fuzzy matching and DistilBERT methods.

In [3]:
df.head()

Unnamed: 0,author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score,title,matched_title_deterministic,matched_title_fuzzy,fuzzy_title_score
0,"Du Creux, François, 1596?-1666.",,,0.0,"{'authorized_name': 'cruz, luís da, 1543-1604...",0.467436,"Historiæ canadensis, seu Novæ-Franciæ libri de...",Unknown,Unknown,0.0
1,"Meyer, Ernst H. F. 1791-1858.",,,0.0,"{'authorized_name': 'meyer, wilhelm, 1845-1917...",0.999939,Ernesti Meyer de plantis labradoricis libri tres.,Unknown,Unknown,0.0
2,"Laet, Joannes de, 1593-1649.",,,0.0,"{'authorized_name': 'larroumet, gustave', 'aut...",0.494394,"Novus orbis, seu Descriptionis Indiae Occident...",Unknown,Unknown,0.0
3,"Caesar, Julius",,"{'authorized_name': 'caesar, julius', 'author_...",0.96,"{'authorized_name': 'caesar, julius', 'author_...",0.999999,C. Julii Cæsaris commentariorum De Bello Galli...,Unknown,Unknown,0.0
4,Unknown,,,0.0,{'authorized_name': 'stephanus abbas 4. or 6th...,0.177454,Collectanea latina seu ecclesiasticæ antiquita...,Unknown,Unknown,0.0


In [4]:
# Get the number of unique values in the `author` column
df['author'].nunique()

5556

That number hasn't changed from previous analyses, so we'll still see how well the different methods performed on authors unknown to them.

Note that this does not mean that there are 5,556 unique authors in the dataframe. Rather, there are 5,556 unique name forms. For example, "Virgil", "Virgil,", "Virgil." (note the punctuation marks in the last two), "Virgile (0070-0019 av. J.-C.).", and "Virgilio Marón, Publio" all refer to the same person.

## Deterministic Author Matching

I want to see where this method reported a match.

In [6]:
# Make a dataframe of just the author and deterministic author columns
deterministic = df[['author','deterministic_author']]
# Filter out the NA values
deterministic = deterministic[deterministic['deterministic_author'].notna()]
deterministic.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4339 entries, 6 to 13488
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   author                4339 non-null   object
 1   deterministic_author  4339 non-null   object
dtypes: object(2)
memory usage: 101.7+ KB


In [7]:
# Count the number of unique values in the deterministic_author column
print(f"There are {deterministic['deterministic_author'].nunique()} unique authors in the deterministic dataframe.")

There are 444 unique authors in the deterministic dataframe.


There are 444 unique values in the `deterministic_author` column. That's the same number that I found in `analysis-2.ipynb`, which was twelve more than in `analysis.ipynb`.

I'm going to make a CSV file so that I can investigate the matches more easily. I'll sort the `deterministic` dataframe by the `author` column first, then save the sorted dataframe as a CSV file.

In [8]:
sorted_deterministic = deterministic.sort_values(axis='index',by='author')
sorted_deterministic_deduped = sorted_deterministic.drop_duplicates(subset='author')

In [9]:
import csv
sorted_deterministic_deduped.to_csv('../output/deterministic_author.csv',index=False,quoting=csv.QUOTE_ALL)

As expected, every match achieved by the deterministic method was 100% accurate. On the other hand, it matched only 444 of the 5,556 unique values in the `author` column, or about 8%. But that doesn't mean that there are 5,112 unmatched individual authors. Rather, there are 5,112 unmatched author name forms. Unfortunately, we won't know how many unique authors those 5,112 unmatched name forms belong to. We won't know that until we've successfully matched as many of them as possible with their authorized name forms.

Just for the sake of having the information, I'll inspect the author name forms that were deterministically matched. Since the `deterministic_author` column has dictionary values (e.g, `{'authorized_name': 'virgil', 'author_id': 'A4830'}`), I'll need to process the column by retrieving just the authorized name form.

In [17]:
import ast
sorted_deterministic_deduped.loc[:,'deterministic_author_name'] = sorted_deterministic_deduped['deterministic_author'].apply(lambda x: ast.literal_eval(x)['authorized_name'] if isinstance(x, str) else x['authorized_name'])

In [18]:
for author in sorted_deterministic_deduped['deterministic_author_name'].unique():
    print(author)

abbo, monk of st. germain, approximately 850-approximately 923
abelard, peter
acosta, josé de, 1540-1600
agricola, georg, 1494-1555
agricola, rodolphus, 1443?-1485
agrippa von nettesheim, heinrich cornelius, 1486?-1535
agustín, antonio, 1517-1586
ailly, pierre d', 1350-1420
alanus, de insulis
alberti, leon battista, 1404-1472
albertus, magnus, saint, 1193?-1280
albertus, de saxonia, -1390
alciati, andrea, 1492-1550
aldhelm, saint
aldrovandi, ulisse, 1522-1605?
alexander, of hales, approximately 1185-1245
alexander, de villa dei
alfred, of sareshel
alvares, manuel, 1526-1583
ambrose, saint, bishop of milan
amelli, ambrogio, 1848-1933
ammianus marcellinus
ampelius, lucius
andreas, capellanus
andreä, johann valentin, 1586-1654
angèli, pietro, 1517-1596
anselm, saint, archbishop of canterbury
antonius, marcus, 83 b.c.?-30 b.c.
apponius
arator, subdiaconus
arnobius, of sicca
astruc, jean, 1684-1766
augurelli, giovanni aurelio, approximately 1456-1524?
augustine, of hippo, saint, 354-430

It's quite a mix of Classical, Medieval, and Neo-Latin authors!

I'll remove these authors from the analysis of results from the fuzzy matching and DistilBERT matching, since they can be considered 100% matched.

In [26]:
# Use a list comprehension with the unique() method to make a list of deterministically matched authors.
deterministically_matched_authors = [author for author in sorted_deterministic_deduped['deterministic_author_name'].unique()]

list

## Fuzzy Author Matching

In [34]:
# Make a dataframe of the fuzzy matching columns and the author column
fuzzy = df[['author','fuzzy_author','fuzzy_author_score']]
# Eliminate any NaN cells
fuzzy = fuzzy[fuzzy['fuzzy_author'].notna()]
fuzzy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5674 entries, 3 to 13488
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   author              5674 non-null   object 
 1   fuzzy_author        5674 non-null   object 
 2   fuzzy_author_score  5674 non-null   float64
dtypes: float64(1), object(2)
memory usage: 177.3+ KB


In [35]:
# Count the number of unique values in the fuzzy_author column
fuzzy['fuzzy_author'].nunique()

555

The fuzzy matching method returned 555 unique values (down from 1,081 in `analysis.ipynb`, but up from 551 in `analysis-2.ipynb`), compared to the 5,556 unique values in the `author` column.

I'll remove the authors that we already know were deterministically matched so that we can examine what value, if any, the fuzzy matching algorithm added. To do that I need to get the `authorized_name` value from the dictionary in the `fuzzy_author` column.

In [36]:
# Get the authorized name values
fuzzy.loc[:,'fuzzy_author_name'] = fuzzy['fuzzy_author'].apply(lambda x: ast.literal_eval(x)['authorized_name'] if isinstance(x, str) else x['authorized_name'])
# Remove the deterministically matched authors
fuzzy = fuzzy[~fuzzy['fuzzy_author_name'].isin(deterministically_matched_authors)]
fuzzy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 527 entries, 7 to 13482
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   author              527 non-null    object 
 1   fuzzy_author        527 non-null    object 
 2   fuzzy_author_score  527 non-null    float64
 3   fuzzy_author_name   527 non-null    object 
dtypes: float64(1), object(3)
memory usage: 20.6+ KB


That removed all but 527 entries!

Now let's see which authors weren't matched at all.

In [39]:
unmatched_fuzzy = df[df['fuzzy_author'].isna()]
# Show the number of unique author values in the unmatched_fuzzy dataframe
display(unmatched_fuzzy['author'].nunique())
# Display the list of unmatched author names
unmatched_authors_list = [author for author in sorted(unmatched_fuzzy['author'].unique())]
for author in unmatched_authors_list:
    print(author)


4527

Abad, Diego José, 1727-1779
Abadía de Santillana del Mar.
Abati, Baldo Angelo
Abaunza, Pedro de 1599-1649.
Abbatius, Baldus Angelus, 16th cent.
Abbeloos, J. B. 1836-1896.
Abbeloos, Jean Baptiste, 1836-1906.
Abicht, Rudolf, 1850-1921.
Abrahams, Nicolai Christian Levin, 1798-1870.
Abril, Pedro Simón, ca. 1530- ca. 1595.
Abu al-Faraj al-Isbahani, 897 or 8-967.
Abū Miḥjan al-Thaqafī, active 629-637
Abū Tammām Ḥabīb ibn Aws al-Ṭāʼī, active 808-842
Abū Tammām Ḥabīb ibn Aws al-Ṭāʾī, fl. 808-842,
Abū al-Rabīʻ Sulaymān ibn ʻAbd Allāh al-Muwaḥḥid.
Abū ʻUbayd al-Qāsim ibn Sallām, approximately 773-approximately 837
Abū al-Faraj al-Iṣbahānī, 897 or 898-967.
Academia Molshemensis (Francia)
Accademia degli Occulti (Brescia)
Acevedo, Alfonso de, 1518-1598
Achillini, Alessandro
Achillini, Alessandro, 1463-1512.
Acidalius, Valens, 1567-1595
Ackermann, Johann Christian Gottlieb, 1756-1801
Ackermann, Petrus Fouerius, 1771-1831
Aconcio, Iacopo, -1566.
Actuarius, Johannes.
Adam, 

A lot of the authors in that list do not look familiar to me, so it is likely that they simply do not yet have records in the DLL's catalog. On the other hand, several of the names *should have been matched*. For example, "Apicius." and "Apuleius." should have been matched, but that punctuation mark appears to have foiled the fuzzy matching routine. I'll fix that function and try again.