# Data Preparation

I'm going to try a hybrid approach to using existing data with traditional string matching techniques, with a fallback to AI/ML to suggest possibilities for unknowns or poor matches.

I need to set up two sets of data:

1. Author data: This needs to include author ID, authorized name, and variant names
2. Title data: This needs to include work ID, work name, and associated author ID.

## Authors

I have an existing dataset made up of authority records from the DLL's catalog. I augmented it with variant names scraped from the Virtual International Authority File for each author. It also contains some Greek author names, but I'm not going to use those for this project.

In [1]:
import pandas as pd

# Read in the Authors data
df = pd.read_csv('../data/authors.csv', encoding='utf-8',quotechar='"')
# Get basic information about the dataframe
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37535 entries, 0 to 37534
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   DLL Identifier (Author)  37535 non-null  object
 1   Authorized Name          37535 non-null  object
 2   Language                 37535 non-null  object
 3   Variant                  37522 non-null  object
dtypes: object(4)
memory usage: 1.1+ MB
None


In [2]:
df.head()

Unnamed: 0,DLL Identifier (Author),Authorized Name,Language,Variant
0,G55561,Lycus Rheginus 3. Jh. v. Chr,Greek,Lycus Rheginus 3. Jh. v. Chr
1,G05484,"Nicephorus Saint, Patriarch of Constantinople",Greek,Nicéphore Ier
2,G36001,"Clitophon Rhodius, 1./2. Jh. v. Chr.",Greek,Clitofó
3,G00491,"Demetrius, of Phaleron, b. ca. 350 B.C.",Greek,Démétrios de Phalère 0350?-0283? av. J.-C.
4,G43722,"Echembrotus Lyricus, 6. Jh. v. Chr",Greek,Equembrot


In [3]:
# Remove the Greek authors
df = df[df['Language'] == 'Latin']
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 35325 entries, 2210 to 37534
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   DLL Identifier (Author)  35325 non-null  object
 1   Authorized Name          35325 non-null  object
 2   Language                 35325 non-null  object
 3   Variant                  35312 non-null  object
dtypes: object(4)
memory usage: 1.3+ MB


Unnamed: 0,DLL Identifier (Author),Authorized Name,Language,Variant
2210,A1868,"Herryson, Joannes",Latin,"Herryson, Joannes floruit=15th Century A.D."
2211,A1868,"Herryson, Joannes",Latin,Joannes Herryson
2212,A1868,"Herryson, Joannes",Latin,John Herryson
2213,A1868,"Herryson, Joannes",Latin,"Heryyson, Joannes floruit=15th Century A.D."
2214,A1868,"Herryson, Joannes",Latin,"Herryson, xoannes floruit=15th Century A.D."


In [4]:
# Remove the Language column
df = df.drop(columns='Language')

In [5]:
# Rearrange the columns
df = df[['Variant','Authorized Name','DLL Identifier (Author)']]

In [6]:
# Check for duplicated Variant names
df['Variant'].duplicated().value_counts()

Variant
False    27291
True      8034
Name: count, dtype: int64

In [7]:
deduplicated = df.drop_duplicates(subset=['Variant'])
deduplicated.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27291 entries, 2210 to 37534
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Variant                  27290 non-null  object
 1   Authorized Name          27291 non-null  object
 2   DLL Identifier (Author)  27291 non-null  object
dtypes: object(3)
memory usage: 852.8+ KB


In [8]:
# Check for null values
authors = deduplicated.dropna(subset=['Variant'])
authors.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27290 entries, 2210 to 37534
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Variant                  27290 non-null  object
 1   Authorized Name          27290 non-null  object
 2   DLL Identifier (Author)  27290 non-null  object
dtypes: object(3)
memory usage: 852.8+ KB


## Works

The works CSV contains data from work records, item records, and web page records in the DLL catalog.

In [9]:
# Read in the data
df = pd.read_csv('../data/works.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5763 entries, 0 to 5762
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Creator                  5334 non-null   object
 1   Title                    5763 non-null   object
 2   Authorized Title         5763 non-null   object
 3   DLL Identifier (Work)    5763 non-null   object
 4   DLL Identifier (Author)  5332 non-null   object
dtypes: object(5)
memory usage: 225.2+ KB


Unnamed: 0,Creator,Title,Authorized Title,DLL Identifier (Work),DLL Identifier (Author)
0,"Gilles, de Corbeil, active 1200",De signis et symptomatibus aegritudinum,De signis et symptomatibus aegritudinum,W10655,A3919
1,"Godis, Petrus de",De coniuratione Porcaria dialogus,De coniuratione Porcaria dialogus,W10654,A3221
2,"William, of Blois",Alda,Alda,W10653,A4844
3,"Suetonius, approximately 69-approximately 122",De Viris Illustribus,De Viris Illustribus,W10652,A4799
4,"Suetonius, approximately 69-approximately 122",De Philosophis,De Philosophis,W10651,A4799


In [10]:
# Drop the column we don't need
df = df[['Title','DLL Identifier (Work)','DLL Identifier (Author)']]


In [11]:
# Find duplicated rows
duplicates = df[df.duplicated()]

In [12]:
duplicates

Unnamed: 0,Title,DLL Identifier (Work),DLL Identifier (Author)
1032,Priapeum 'Quid Hoc Novi Est?' (Appendix Vergil...,W4720,
1033,Carmen Arvale,W4629,
1034,Comment. Anquisit. Sergii,W4638,
1035,Carmen Devotionis,W4648,
1036,Carmen Evocationis,W4623,
1037,Commentarii Augurum,W4665,
1038,Grammatica (Anonymi Grammatici),W4654,
1039,Bucolica Einsidlensia,W4912,
1040,Commentarii Consulares,W4626,
4096,De Fide Orthodoxa,W2778,A4376


In [13]:
# Drop the rows from the original dataframe with na value in the Author ID column
df_no_na = df.dropna(subset=['DLL Identifier (Author)'])
df_no_na.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5332 entries, 0 to 5762
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Title                    5332 non-null   object
 1   DLL Identifier (Work)    5332 non-null   object
 2   DLL Identifier (Author)  5332 non-null   object
dtypes: object(3)
memory usage: 166.6+ KB


In [14]:
# Find duplicated rows
duplicates = df_no_na[df_no_na.duplicated()]
duplicates

Unnamed: 0,Title,DLL Identifier (Work),DLL Identifier (Author)
4096,De Fide Orthodoxa,W2778,A4376
5334,Sententiae,W257,A4504
5336,De Agri Cultura,W5351,A3400
5339,Commentarii de Bello Civili,W3983,A3607
5341,Antibucolica,W5095,A3521
5352,Etruscarum Rerum Libri,W3928,A3559
5376,"De medicina, versio Latina",W622,A3618
5383,"Auspiciorum Liber, fragmenta",W97,A3582
5405,Astronomica,W3254,A3487
5410,Fabulae,W3233,A3487


In [15]:
# Quick test to see if these are really duplicates
gai = df_no_na[df_no_na['Title'] == "Gai Institutionum epitome"]
gai

Unnamed: 0,Title,DLL Identifier (Work),DLL Identifier (Author)
966,Gai Institutionum epitome,W2332,A3454
5414,Gai Institutionum epitome,W2332,A3454


In [16]:
# Drop the duplicated rows
works = df_no_na.drop_duplicates()
df_no_na.info()
works.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5332 entries, 0 to 5762
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Title                    5332 non-null   object
 1   DLL Identifier (Work)    5332 non-null   object
 2   DLL Identifier (Author)  5332 non-null   object
dtypes: object(3)
memory usage: 166.6+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 5315 entries, 0 to 5762
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Title                    5315 non-null   object
 1   DLL Identifier (Work)    5315 non-null   object
 2   DLL Identifier (Author)  5315 non-null   object
dtypes: object(3)
memory usage: 166.1+ KB


In [17]:
works

Unnamed: 0,Title,DLL Identifier (Work),DLL Identifier (Author)
0,De signis et symptomatibus aegritudinum,W10655,A3919
1,De coniuratione Porcaria dialogus,W10654,A3221
2,Alda,W10653,A4844
3,De Viris Illustribus,W10652,A4799
4,De Philosophis,W10651,A4799
...,...,...,...
5758,Arnobii Catholici et Serapionis Conflictus de ...,W2526,A5481
5759,Rerum Gestarum (Res Gestae A fine Corneli Tact...,W3912,A5452
5760,Asinus Aureus,W4047,A5463
5761,Pro se de Magia,W4035,A5463


## Normalize the data

I'm going to convert all text to lower case and strip any leading or trailing white space.

In [19]:
import unicodedata
# Normalize text
def normalize_text(text):
    # Convert to lowercase
    text = text.lower()
    # Strip leading and trailing whitespace
    text = text.strip()
    # Normalize unicode (optional)
    text = unicodedata.normalize("NFKD", text)
    return text

authors.loc[:,'Variant'] = authors['Variant'].apply(normalize_text)
authors.loc[:,'Authorized Name'] = authors['Authorized Name'].apply(normalize_text)
works.loc[:,'Title'] = works['Title'].apply(normalize_text)


In [22]:
# Write the data to CSV
import csv
authors.to_csv('../data/authors_db.csv',index=False,quoting=csv.QUOTE_ALL)
works.to_csv('../data/works_db.csv',index=False,quoting=csv.QUOTE_ALL)


In [20]:
authors

Unnamed: 0,Variant,Authorized Name,DLL Identifier (Author)
2210,"herryson, joannes floruit=15th century a.d.","herryson, joannes",A1868
2211,joannes herryson,"herryson, joannes",A1868
2212,john herryson,"herryson, joannes",A1868
2213,"heryyson, joannes floruit=15th century a.d.","herryson, joannes",A1868
2214,"herryson, xoannes floruit=15th century a.d.","herryson, joannes",A1868
...,...,...,...
37530,"calpurnius siculus, titus, činný 1. století","calpurnius siculus, titus",A910
37531,"calpurnius siculus, titus, fl. 60 na chr.","calpurnius siculus, titus",A910
37532,"calpurnius siculus, titus.","calpurnius siculus, titus",A910
37533,titus calpurnius siculus,"calpurnius siculus, titus",A910


In [21]:
works

Unnamed: 0,Title,DLL Identifier (Work),DLL Identifier (Author)
0,de signis et symptomatibus aegritudinum,W10655,A3919
1,de coniuratione porcaria dialogus,W10654,A3221
2,alda,W10653,A4844
3,de viris illustribus,W10652,A4799
4,de philosophis,W10651,A4799
...,...,...,...
5758,arnobii catholici et serapionis conflictus de ...,W2526,A5481
5759,rerum gestarum (res gestae a fine corneli tact...,W3912,A5452
5760,asinus aureus,W4047,A5463
5761,pro se de magia,W4035,A5463
