# Analysis of Output from Hybrid Approach

In [45]:
# Import the Pandas library
import pandas as pd

# Read in the output data
df = pd.read_csv('../data/output_df.csv',encoding='utf-8',quotechar='"')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24799 entries, 0 to 24798
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   author                   24799 non-null  object 
 1   deterministic_author     9303 non-null   object 
 2   fuzzy_author             22465 non-null  object 
 3   fuzzy_author_score       24799 non-null  float64
 4   distilbert_author        24799 non-null  object 
 5   distilbert_author_score  24799 non-null  float64
 6   title                    24799 non-null  object 
 7   deterministic_title      7932 non-null   object 
 8   fuzzy_title              23933 non-null  object 
 9   fuzzy_title_score        24799 non-null  float64
dtypes: float64(3), object(7)
memory usage: 1.9+ MB


## First impressions

Just looking at the output of the `info()` method, I see that there are 24,799 total rows in the dataframe. Some of the columns do not have data in all of the rows. Specifically, `deterministic_author`, `fuzzy_author`, `deterministic_title`, and `fuzzy_title` have empty cells. That suggests that the deterministic method is by far the most "conservative" approach, and it looks like the fuzzy matching approach is a distant second in that regard.

In [46]:
df.describe()

Unnamed: 0,fuzzy_author_score,distilbert_author_score,fuzzy_title_score
count,24799.0,24799.0,24799.0
mean,84.393939,0.820866,0.82649
std,27.896863,0.283701,0.157406
min,0.0,0.082294,0.0
25%,85.5,0.679279,0.855
50%,90.0,0.999989,0.855
75%,100.0,0.999999,0.855
max,100.0,1.0,0.917241


The `describe()` method only covers three columns, but it tells its own story. The fuzzy matching method appears to be slightly more liberal, with a 84.39% matching average, while the DistilBERT model had an 82.19% average. More revealing is the quartile output, with 25% of the records getting an 85.5% match from the fuzzy method, while only the DistilBERT model returned 67.9% at that level. On the other hand, at the 50% mark and above, the DistilBERT model appears to be slightly more confident.

In [47]:
df.head()

Unnamed: 0,author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score,title,deterministic_title,fuzzy_title,fuzzy_title_score
0,"Du Creux, François, 1596?-1666.",,"{'authorized_name': 'mirk, john, active 1403?'...",85.5,"{'authorized_name': 'cruz, luís da, 1543-1604...",0.467436,"Historiæ canadensis, seu Novæ-Franciæ libri de...",,"{'dll_id_work': 'W10626', 'dll_id_author': 'A5...",0.855
1,"Meyer, Ernst H. F. 1791-1858.",,"{'authorized_name': 'vopiscus, flavius', 'auth...",85.5,"{'authorized_name': 'meyer, wilhelm, 1845-1917...",0.999939,Ernesti Meyer de plantis labradoricis libri tres.,,"{'dll_id_work': 'W4469', 'dll_id_author': 'A49...",0.855
2,"Laet, Joannes de, 1593-1649.",,"{'authorized_name': 'herryson, joannes', 'auth...",85.5,"{'authorized_name': 'larroumet, gustave', 'aut...",0.494394,"Novus orbis, seu Descriptionis Indiae Occident...",,"{'dll_id_work': 'W10655', 'dll_id_author': 'A3...",0.855
3,"Caesar, Julius",,"{'authorized_name': 'caesar, julius', 'author_...",96.0,"{'authorized_name': 'caesar, julius', 'author_...",0.999999,C. Julii Cæsaris commentariorum De Bello Galli...,,"{'dll_id_work': 'W5389', 'dll_id_author': 'A33...",0.855
4,Unknown,,,0.0,{'authorized_name': 'stephanus abbas 4. or 6th...,0.177454,Collectanea latina seu ecclesiasticæ antiquita...,,"{'dll_id_work': 'W10631', 'dll_id_author': 'A3...",0.855


In [48]:
# Get the number of unique values in the `author` column
df['author'].nunique()

6018

## Dealing with Greek Names

It turns out that a lot of the values in the `author` column are Greek names. That's because the titles of critical editions of Greek texts are often published with Latinized forms of the authors' names and titles. That actually makes the results a little better.

**Conclusion** I should remove the Greek authors before doing any analysis.

Fortunately, I have a list of Greek author names that I pulled from the TLG and the Perseus Project. I'll use those to reduce the number of records in the dataframe.

In [49]:
# Read in Greek names
greek = pd.read_csv('../data/greek.csv',encoding='utf-8')


This is from a previous project. The `Latin` column has `1` for "Latin" and `0` for "Greek". I'll filter it down to just those with the "Greek" classification.

I'll get a sorted list of the unique values from `Authorized Name` and use that to filter the `output_df` dataframe.

In [51]:
greek_list = sorted(greek['Authorized Name'].unique())
print(f"Number of unique authors in the Greek list: {len(greek_list)}")
# Use the `unique_greek_names` list to filter the original `output_df` dataframe
latin_only = df[~df['author'].isin(greek_list)]
# Get the number of unique values in the original output_df datframe
print(f"Number of unique values in the original dataframe: {df['author'].nunique()}")
# Get the number of unique values in the latin_only dataframe
print(f"Number of unique values in the latin_only dataframe: {latin_only['author'].nunique()}")

Number of unique authors in the Greek list: 2210
Number of unique values in the original dataframe: 6018
Number of unique values in the latin_only dataframe: 5939


In [52]:
removed_authors = df[df['author'].isin(greek_list)]
removed_authors_list = sorted(removed_authors['author'].unique())
# Check the list of authors removed
for author in removed_authors_list:
    print(author)

Aelian, 3rd cent.
Aeschylus
Alexander, of Aphrodisias
Apollonius, Rhodius
Apollonius, of Perga
Archimedes
Aretaeus, of Cappadocia
Aristides Quintilianus
Aristophanes
Aristotle
Arrian
Artemidorus, Daldianus
Athenaeus, of Naucratis
Atilius Fortunatianus, 4th cent.
Ausonius, Decimus Magnus
Bacchylides
Bion, of Phlossa near Smyrna
Cassius Dio Cocceianus
Celsus, Aulus Cornelius
Charisius, Flavius Sosipater.
Colluthus, of Lycopolis
Cyril, Saint, Bishop of Jerusalem, approximately 315-386
Demetrius, of Phaleron, b. ca. 350 B.C.
Demosthenes
Dictys, Cretensis
Diodorus, Siculus
Diogenes Laertius
Dionysius, of Halicarnassus
Diophantus, of Alexandria
Dioscorides Pedanius, of Anazarbos
Dositheus, Magister
Ephraem, Syrus, Saint, 303-373
Epictetus
Epiphanius, Saint, Bishop of Constantia in Cyprus, approximately 310-403
Euripides
Firmicus Maternus, Julius.
Galen
Gregory, of Nazianzus, Saint
Harpocration, Valerius
Heliodorus, of Emesa
Heraclitus, of Ephesus
Herodotus
Hesiod
Hippocrates
Homer
Irenaeus, 

In [54]:
# Check the remaining authors for any Greek names that were missed.
for author in sorted(latin_only['author'].unique()):
    print(author)

Abad, Diego José, 1727-1779
Abadía de Santillana del Mar.
Abati, Baldo Angelo
Abaunza, Pedro de 1599-1649.
Abbatius, Baldus Angelus, 16th cent.
Abbeloos, J. B. 1836-1896.
Abbeloos, Jean Baptiste, 1836-1906.
Abbo, Monk of St. Germain, approximately 850-approximately 923.
Abdias, Obispo de Babilonia.
Abelard, Peter, 1079-1142.
Abicht, Rudolf, 1850-1921.
Abrahams, Nicolai Christian Levin, 1798-1870.
Abril, Pedro Simón, ca. 1530- ca. 1595.
Abu al-Faraj al-Isbahani, 897 or 8-967.
Abū Miḥjan al-Thaqafī, active 629-637
Abū Miḥjan al-Thaqafī, fl. 629-637.
Abū Tammām Ḥabīb ibn Aws al-Ṭāʼī, active 808-842
Abū Tammām Ḥabīb ibn Aws al-Ṭāʾī, fl. 808-842,
Abū al-Rabīʻ Sulaymān ibn ʻAbd Allāh al-Muwaḥḥid.
Abū ʻUbayd al-Qāsim ibn Sallām, approximately 773-approximately 837
Abū al-Faraj al-Iṣbahānī, 897 or 898-967.
Academia Molshemensis (Francia)
Accademia degli Occulti (Brescia)
Acevedo, Alfonso de, 1518-1598
Achilles Tatius
Achilles Tatius.
Achillini, Alessandro
Achillin

This is bad. There are still lots of Greek authors in the mix. 

**Conclusion**: I need to train another model to identify Greek author names.

In [79]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
# Ensure labels are loaded
greek_model_path = '../greek'
greek_model = DistilBertForSequenceClassification.from_pretrained(greek_model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(greek_model_path)

if not hasattr(greek_model.config, "id2label") or not greek_model.config.id2label:
    # Define your labels if they weren't saved correctly
    labels = ["Latin", "Greek"]
    greek_model.config.id2label = {idx: label for idx, label in enumerate(labels)}
    greek_model.config.label2id = {label: idx for idx, label in enumerate(labels)}

greek_model.config.id2label = {0: "Greek", 1: "Latin"}
greek_model.config.label2id = {"Greek": 0, "Latin": 1}

# Save the updated model configuration
greek_model.save_pretrained('./greek')

In [84]:
import torch.nn.functional as F

def classify_author_language(input_author):
    """Classify the author's language as Greek or Latin using the fine-tuned model."""
    if not isinstance(input_author, str):
        return "Unknown", 0.0

    # Tokenize and encode the input author name
    inputs = tokenizer(input_author, return_tensors="pt", truncation=True, padding=True)
    outputs = greek_model(**inputs)
    logits = outputs.logits.detach().cpu()

    # Apply softmax to logits to get probabilities
    probabilities = F.softmax(logits, dim=-1).numpy()
    predicted_class = probabilities.argmax(axis=-1)[0]
    confidence = probabilities.max()  # Highest probability

    # Use the id2label mapping for label names
    predicted_label = greek_model.config.id2label[predicted_class]
    print(f"Language Classification: {input_author} -> {predicted_label} (Confidence: {confidence:.4f})")

    return predicted_label, confidence

def classify_and_split_by_language(processed_df):
    """Classify authors as Greek or Latin and split the dataframe."""
    language_results = []

    for _, row in processed_df.iterrows():
        input_author = row["author"]

        # Perform language classification
        author_language, language_confidence = classify_author_language(input_author)

        # Add the results to a new row
        language_results.append({
            **row,  # Include existing row data
            "language": author_language,
            "language_confidence": language_confidence,
        })

    # Create an updated dataframe with language classification
    classified_df = pd.DataFrame(language_results)

    # Split the dataframe into Greek and Latin subsets
    greek_df = classified_df[classified_df["language"] == "Greek"].reset_index(drop=True)
    latin_df = classified_df[classified_df["language"] == "Latin"].reset_index(drop=True)

    return classified_df, greek_df, latin_df


In [85]:
import csv
# Classify authors by language and split into Greek/Latin dataframes
classified_df, greek_df, latin_df = classify_and_split_by_language(df)

# Save the results if needed
classified_df.to_csv("../data/classified_metadata.csv", index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)
greek_df.to_csv("../data/greek_authors.csv", index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)
latin_df.to_csv("../data/latin_authors.csv", index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)

# Display the results
print("Classified DataFrame:")
print(classified_df.head())
print("\nGreek Authors DataFrame:")
print(greek_df.head())
print("\nLatin Authors DataFrame:")
print(latin_df.head())

Language Classification: Du Creux, François, 1596?-1666. -> Latin (Confidence: 1.0000)
Language Classification: Meyer, Ernst H. F. 1791-1858. -> Latin (Confidence: 1.0000)
Language Classification: Laet, Joannes de, 1593-1649. -> Latin (Confidence: 1.0000)
Language Classification: Caesar, Julius -> Latin (Confidence: 1.0000)
Language Classification: Unknown -> Greek (Confidence: 0.9993)
Language Classification: Drexel, Jeremias, 1581-1638, -> Latin (Confidence: 1.0000)
Language Classification: Kircher, Athanasius, 1602-1680 -> Latin (Confidence: 1.0000)
Language Classification: Drexel, Jeremias, 1581-1638, -> Latin (Confidence: 1.0000)
Language Classification: Drexel, Jeremias, 1581-1638, -> Latin (Confidence: 1.0000)
Language Classification: Drexel, Jeremias, 1581-1638, -> Latin (Confidence: 1.0000)
Language Classification: Drexel, Jeremias, 1581-1638, -> Latin (Confidence: 1.0000)
Language Classification: Hincmar, Archbishop of Reims, approximately 806-882 -> Latin (Confidence: 1.0000

It took 18m 37.0s to classify the authors as "Latin" or "Greek".

Let's inspect the Greek and Latin sets.

In [90]:
print(f"Number of rows in the Greek dataframe: {len(greek_df)}")
print(f"Number of rows in the Latin dataframe: {len(latin_df)}")

print(f"The classification step removed {len(classified_df)-len(latin_df)} records")



Number of rows in the Greek dataframe: 4088
Number of rows in the Latin dataframe: 20711
The classification step removed 4088 records


Let's inspect the content of the Greek dataframe.

In [91]:
for author in greek_df['author'].unique():
    print(author)

Unknown
Arrian.
Euclid,
Herodotus.
Celsus, Aulus Cornelius.
Nonius Marcellus, active 4th century.
Juvenal.
Claudianus, Claudius
Anacreon
John Chrysostom, Saint, -407
Clement, of Alexandria, Saint, approximately 150-approximately 215
Herodotus
Nemesius, Bp. of Emesa
Pindar
Theocritus.
Xenophon
Monachus, Haymarus, Patriarch of Jerusalem, d. 1202.
Catherine, of Alexandria, Saint.
Aristophanes.
Index librorum prohibitorum.
Homer
Epiphanius, Saint, Bishop of Constantia in Cyprus, approximately 310-403.
Euripides
Philo, of Alexandria
Plato.
Plato
Plotinus
Demosthenes
Plutarch.
Homer.
Theophrastus.
Thucydides
Epictetus
Origen
Celsus, Aulus Cornelius
Epiphanius, Saint, Bishop of Constantia in Cyprus, approximately 310-403
Aristotle
Archimedes.
Dioscorides Pedanius, of Anazarbos
Nicomachus, of Gerasa.
Proclus, approximately 410-485
Xenophon, of Ephesus
Nonius Marcellus, active 4th century
Gregory, of Nyssa, Saint, approximately 335-approximately 394.
Theocritus
Sextus, Empiricus
Philostratus, t

Weird. Juvenal was classified as a Greek author! I should move him back to the Latin author dataframe.

In [94]:
juvenal_list = ['Juvenal.','Juvenal','Juvenalis, D.J.']
juvenal = greek_df[greek_df['author'].astype('str').isin(juvenal_list)]
juvenal

Unnamed: 0,author,deterministic_author,fuzzy_author,fuzzy_author_score,distilbert_author,distilbert_author_score,title,deterministic_title,fuzzy_title,fuzzy_title_score,language,language_confidence
8,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,D. Ivni Ivvenalis Satvrarvm libri V; edited wi...,,"{'dll_id_work': 'W5389', 'dll_id_author': 'A33...",0.855,Greek,0.999857
108,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,D. Junii Juvenalis Saturarum libri v. mit erkl...,"{'dll_id_work': 'W5075', 'dll_id_author': 'A36...","{'dll_id_work': 'W5389', 'dll_id_author': 'A33...",0.855,Greek,0.999857
233,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,D. Iunii Iuvenalis Satirarum libri quinque. Ac...,,"{'dll_id_work': 'W5389', 'dll_id_author': 'A33...",0.855,Greek,0.999857
239,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,D. Iuni Iuvenalis Saturarum libri V / edited w...,"{'dll_id_work': 'W5075', 'dll_id_author': 'A36...","{'dll_id_work': 'W5389', 'dll_id_author': 'A33...",0.855,Greek,0.999857
312,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,Ivnii Ivvenalis Aqvinatis Satyrographi opvs : ...,,"{'dll_id_work': 'W10631', 'dll_id_author': 'A3...",0.855,Greek,0.999857
...,...,...,...,...,...,...,...,...,...,...,...,...
3966,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,D. Iunii Iuvenalis Satirarum libri quinque : a...,,"{'dll_id_work': 'W10631', 'dll_id_author': 'A3...",0.855,Greek,0.999857
4022,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,Saturarum libri V cum scholiis veteribus / rec...,"{'dll_id_work': 'W5075', 'dll_id_author': 'A36...","{'dll_id_work': 'W10655', 'dll_id_author': 'A3...",0.855,Greek,0.999857
4036,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,D. Iunii Iuvenalis Satirarum libri quinque. Ac...,,"{'dll_id_work': 'W5389', 'dll_id_author': 'A33...",0.855,Greek,0.999857
4080,Juvenal.,"{'authorized_name': 'juvenal', 'author_id': 'A...","{'authorized_name': 'juvenal', 'author_id': 'A...",100.0,"{'authorized_name': 'juvenal', 'author_id': 'A...",0.999999,La Satira prima di D. Giunio Giovenale / comme...,,"{'dll_id_work': 'W1590', 'dll_id_author': 'A30...",0.855,Greek,0.999857


## Deterministic Author Matching

I want to see where this method reported a match.

In [5]:
# Make a dataframe of just the author and deterministic author columns
deterministic = df[['author','deterministic_author']]
# Filter out the NA values
deterministic = deterministic[deterministic['deterministic_author'].notna()]
deterministic.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9303 entries, 6 to 24796
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   author                9303 non-null   object
 1   deterministic_author  9303 non-null   object
dtypes: object(2)
memory usage: 218.0+ KB


In [6]:
# Count the number of unique values in the deterministic_author column
deterministic['deterministic_author'].nunique()

461

There are 461 unique values in the `deterministic_author` column.

I'm going to make a CSV file so that I can investigate the matches more easily. I'll sort the `deterministic` dataframe by the `author` column first, then save the sorted dataframe as a CSV file.

In [7]:
sorted_deterministic = deterministic.sort_values(axis='index',by='author')

In [8]:
import csv
sorted_deterministic.to_csv('../data/deterministic_author.csv',index=False,quoting=csv.QUOTE_ALL)

As expected, every match achieved by the deterministic method was 100% accurate. On the other hand, it matched only 491 of the 6,018 unique values in the `author` column.

## Fuzzy Author Matching

In [9]:
# Make a dataframe of the fuzzy matching columns and the author column
fuzzy = df[['author','fuzzy_author','fuzzy_author_score']]
# Eliminate any NaN cells
fuzzy = fuzzy[fuzzy['fuzzy_author'].notna()]

fuzzy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22465 entries, 0 to 24797
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   author              22465 non-null  object 
 1   fuzzy_author        22465 non-null  object 
 2   fuzzy_author_score  22465 non-null  float64
dtypes: float64(1), object(2)
memory usage: 702.0+ KB


In [10]:
# Count the number of unique values in the fuzzy_author column
fuzzy['fuzzy_author'].nunique()

1140

The fuzzy matching method returned 1,140 unique values compared to the 6,018 unique values in the `author` column.

Let's see the rows where there isn't a match.

In [11]:
unmatched_fuzzy = df[df['fuzzy_author'].isna()]
# Show the number of unique author values in the unmatched_fuzzy dataframe
display(unmatched_fuzzy['author'].nunique())
# Display the list of unmatched author names
authors_list = unmatched_fuzzy['author'].to_list()
set(authors_list)


467

{'Academia Molshemensis (Francia)',
 'Achilles Tatius',
 'Achilles Tatius.',
 'Acidalius, Valens, 1567-1595',
 'Aeschines.',
 'Aeschylus',
 'Aesop',
 'Aesop.',
 'Agathias, -582',
 'Agathias, -582.',
 'Albertini, Hannibal.',
 'Albumasar.',
 'Alciphron.',
 'Alegambe, Philippo, 1592-1652',
 'Alvisi, Edoardo, 1850-1915.',
 'Amama, Sixtinus, 1593-1629.',
 'Amati, Pasquale,',
 'Anacreon',
 'Anacreon.',
 'Anacreonte.',
 'Andreantonelli, Sebastiano.',
 'Anhalt, Ottocar.',
 'Apollodorus.',
 'Apollonius, Paradoxographus.',
 'Apollonius, Rhodius',
 'Apollonius, Rhodius.',
 'Apollonius, paradoxographus.',
 'Appendix Vergiliana.',
 'Archimedes',
 'Archimedes.',
 'Aristophanes',
 'Aristophanes.',
 'Aristófanes.',
 'Arngrímur Jónsson, 1568-1648.',
 'Artemidoro.',
 'Artephius,',
 'Artephius.',
 'Asclepiadeus, Androphilus.',
 'Auerbach, Bertrand, 1856-1942',
 'Augenio, Orazio',
 'Augenio, Orazio.',
 'Autolycus.',
 'Bacchylides',
 'Bacchylides.',
 'Bar Hebraeus, 1226-1286.',
 'Barbato, Orazio.',
 'Bar

Aelian, 3rd cent.
Aeschylus
Alexander, of Aphrodisias
Apollonius, Rhodius
Apollonius, of Perga
Archimedes
Aretaeus, of Cappadocia
Aristides Quintilianus
Aristophanes
Aristotle
Arrian
Artemidorus, Daldianus
Athenaeus, of Naucratis
Atilius Fortunatianus, 4th cent.
Ausonius, Decimus Magnus
Bacchylides
Bion, of Phlossa near Smyrna
Cassius Dio Cocceianus
Celsus, Aulus Cornelius
Charisius, Flavius Sosipater.
Colluthus, of Lycopolis
Cyril, Saint, Bishop of Jerusalem, approximately 315-386
Demetrius, of Phaleron, b. ca. 350 B.C.
Demosthenes
Dictys, Cretensis
Diodorus, Siculus
Diogenes Laertius
Dionysius, of Halicarnassus
Diophantus, of Alexandria
Dioscorides Pedanius, of Anazarbos
Dositheus, Magister
Ephraem, Syrus, Saint, 303-373
Epictetus
Epiphanius, Saint, Bishop of Constantia in Cyprus, approximately 310-403
Euripides
Firmicus Maternus, Julius.
Galen
Gregory, of Nazianzus, Saint
Harpocration, Valerius
Heliodorus, of Emesa
Heraclitus, of Ephesus
Herodotus
Hesiod
Hippocrates
Homer
Irenaeus, 

## Deterministic Matching, Again

In [37]:
# Make a dataframe of just the author and deterministic author columns
deterministic = latin_only[['author','deterministic_author']]
# Filter out the NA values
deterministic = deterministic[deterministic['deterministic_author'].notna()]
deterministic.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9150 entries, 6 to 24796
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   author                9150 non-null   object
 1   deterministic_author  9150 non-null   object
dtypes: object(2)
memory usage: 214.5+ KB


The original `deterministic` frame had 9,303 rows. This one has 9,150.

In [38]:
# Make a sorted dataframe of the deterministic values
sorted_deterministic = deterministic.sort_values(axis='index',by='author')
# Save it as a CSV, appending "2" to the title to preserve the original
sorted_deterministic.to_csv('../data/deterministic_author_2.csv',index=False,quoting=csv.QUOTE_ALL)

## Fuzzy Authors, Again

In [44]:
# Make a dataframe of the fuzzy matching columns and the author column
fuzzy = latin_only[['author','fuzzy_author','fuzzy_author_score']]
# Eliminate any NaN cells
fuzzy = fuzzy[fuzzy['fuzzy_author'].notna()]

print(fuzzy.info())

# Count the number of unique values in the fuzzy_author column
print(f"Total number of unique values in the fuzzy_author column: {fuzzy['fuzzy_author'].nunique()}")

unmatched_fuzzy = latin_only[latin_only['fuzzy_author'].isna()]
# Show the number of unique author values in the unmatched_fuzzy dataframe
print(f"Number of unique values in the unmatched_fuzzy dataframe: {unmatched_fuzzy['author'].nunique()}")
# Display the list of unmatched author names
sorted(unmatched_fuzzy['author'].unique())

<class 'pandas.core.frame.DataFrame'>
Index: 21938 entries, 0 to 24797
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   author              21938 non-null  object 
 1   fuzzy_author        21938 non-null  object 
 2   fuzzy_author_score  21938 non-null  float64
dtypes: float64(1), object(2)
memory usage: 685.6+ KB
None
Total number of unique values in the fuzzy_author column: 1135
Number of unique values in the unmatched_fuzzy dataframe: 451


['Academia Molshemensis (Francia)',
 'Achilles Tatius',
 'Achilles Tatius.',
 'Acidalius, Valens, 1567-1595',
 'Aeschines.',
 'Aesop',
 'Aesop.',
 'Agathias, -582',
 'Agathias, -582.',
 'Albertini, Hannibal.',
 'Albumasar.',
 'Alciphron.',
 'Alegambe, Philippo, 1592-1652',
 'Alvisi, Edoardo, 1850-1915.',
 'Amama, Sixtinus, 1593-1629.',
 'Amati, Pasquale,',
 'Anacreon',
 'Anacreon.',
 'Anacreonte.',
 'Andreantonelli, Sebastiano.',
 'Anhalt, Ottocar.',
 'Apollodorus.',
 'Apollonius, Paradoxographus.',
 'Apollonius, Rhodius.',
 'Apollonius, paradoxographus.',
 'Appendix Vergiliana.',
 'Archimedes.',
 'Aristophanes.',
 'Aristófanes.',
 'Arngrímur Jónsson, 1568-1648.',
 'Artemidoro.',
 'Artephius,',
 'Artephius.',
 'Asclepiadeus, Androphilus.',
 'Auerbach, Bertrand, 1856-1942',
 'Augenio, Orazio',
 'Augenio, Orazio.',
 'Autolycus.',
 'Bacchylides.',
 'Bar Hebraeus, 1226-1286.',
 'Barbato, Orazio.',
 'Barbosa, Agostinho, 1590-1649.',
 'Barbosa, Enmanuele.',
 'Barrientos, Bartolomé.',
 'Bar

Hmmm. There are still some Greek names. I guess I need to sift through the "author" column manually?