## Make a Greek Identifier

I need to train a model to correctly identify Greek author names because incoming data for Latin authors often includes Greek authors whose names have been Latinized for the sake of cataloging.

I have gathered a set of Greek names from the [Thesaurus Linguae Graecae](https://stephanus.tlg.uci.edu/) (TLG) project and the [Perseus](https://www.perseus.tufts.edu/hopper/) project. Specifically, the TLG makes lists of authors available at <https://stephanus.tlg.uci.edu/tlgauthors/post_tlg_e.php> and <https://stephanus.tlg.uci.edu/tlgauthors/cd.authors.php>. I also obtained the [Virtual International Authority File](https://viaf.org/) (VIAF) ID numbers for many Greek authors from a database kept for the Perseus Catalog. I'll use a script below to connect to VIAF using those ID's and scrape the alternate name forms for those authors.

Unfortunately, the lists of Greek names include some Roman names. For example, I don't know why Juvenal, Cicero, Varro, Macrobius, Servius, and several others are in the Greek lists, but I'll have to remove them before I add the Latin author names from the DLL Catalog.

In general, any names that are liable to be mistaken for Roman names (e.g., "Rufus" without any additional identifying information) were omitted. Also omitted were any names of works without an author.

In [69]:
import pandas as pd

df = pd.read_csv('../data/greek_cleanup/greek_cleanup/greek.csv')

In [71]:
# Some of the Greek names are duplicates, so I can use the URL column to remove them.
df_deduped = df.drop_duplicates(subset='URL')

In [72]:
df_deduped.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2372 entries, 0 to 6008
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URL     2372 non-null   object
 1   Name    2372 non-null   object
dtypes: object(2)
memory usage: 55.6+ KB


In [77]:
sorted_df = df.sort_values(axis=0,by='URL')

In [79]:
sorted_df['URL'] = sorted_df['URL'].str.strip()

In [80]:
import csv
sorted_df.to_csv('../data/greek_cleanup/greek_cleanup/sorted_df_by_url.csv',index=False,quoting=csv.QUOTE_ALL)

In [81]:
df = pd.read_csv('../data/greek_cleanup/greek_cleanup/sorted_df_by_url.csv',quotechar='"',encoding='utf-8')

In [82]:
sorted_df = df.sort_values(axis=0,by='Name')
sorted_df['Name'] = sorted_df['Name'].str.strip()

In [84]:
deduped_sorted_df = sorted_df.drop_duplicates(subset='Name')
print(len(sorted_df))
print(len(deduped_sorted_df))

5958
2354


In [85]:
import requests
from bs4 import BeautifulSoup

In [87]:
def extract_name_entries(url):
    """
    Extracts all <h2 class="nameEntry"> elements within <div id="Title"> from the given URL.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-GB,en;q=0.5'
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.encoding = 'utf-8'
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            title_div = soup.find('div', id='Title')
            if title_div:
                h2_tags = title_div.find_all('h2', class_='nameEntry')
                list_names = [h2.get_text(separator=' ', strip=True) for h2 in h2_tags]
                for name in list_names:
                    print(name)
                return [h2.get_text(separator=' ', strip=True) for h2 in h2_tags]
            else:
                return ["No <div id='Title'> found"]
        else:
            return [f"Failed to fetch URL (status code: {response.status_code})"]
    except Exception as e:
        return [f"Error: {str(e)}"]
    
df_deduped['Variants'] = df_deduped['URL'].apply(extract_name_entries)

# Flatten the lists in 'name_entries' column for readability (optional)
df_deduped['Variants'] = df_deduped['Variants'].apply(lambda x: '; '.join(x) if isinstance(x, list) else x)

Lycus Rheginus 3. Jh. v. Chr
Licos de Rhègion
Lycus 3. Jh. v. Chr Rheginus
Lycus Rheginus v3. Jh.
Nicéphore Ier, 0758?-0828, patriarche de Constantinople
Nicephorus Constantinopolitanus
Nicephorus, Constantinopolitanus, Patriarch van Constantinopel, ca. 758-ca. 828
Nicéphore 1 0758?-0828? patriarche de Constantinople
Πατριάρχης Νικηφόρος
Nicefor (święty ; 758-828).
Nicèfor, de Constantinoble, sant, aproximadament 758-829
Nikeforos I, helgon, patriark av Konstantinopel, ca 750-828
Nicefor (św. ; ca 758-828)
Nicephorus 1 helgen, patriark av Konstantinopel
Nicephorus Saint, Patriarch of Constantinople
Nicephorus патриарх Константинопольский ок.758-828
Nikefor, svatý, asi 758-828
Nikifor, Patriarkh konstantinopolʹskiĭ
Nikēphorós 758-829 šventasis Konstantinopolio patriarchas
Nicephorus, Chartophylax
Clitofó
Clitophon Rhodius v1./2. Jh.
Démétrios de Phalère 0350?-0283? av. J.-C.
Δημήτριος ο Φαληρεύς
Demetrius, of Phaleron, approximately 350 B.C.-
Demetrios från Faleron, ca 350-ca 283 f.Kr.


In [90]:
sorted_df_deduped = df_deduped.sort_values(axis=0,by='Name')
sorted_df_deduped['Name'] = sorted_df_deduped['Name'].str.strip() 

sorted_df_deduped.to_csv('../data/greek_cleanup/viaf_greek_variants.csv',index=False,encoding='utf-8',quoting=csv.QUOTE_ALL)

In [91]:
greek_latin_removed = pd.read_csv('../data/greek_cleanup/viaf_greek_variants.csv',encoding='utf-8',quotechar='"')

In [93]:
sorted_variants = greek_latin_removed.sort_values(axis=0,by='Variants')
sorted_variants['Variants'] = sorted_variants['Variants'].str.strip()
sorted_variants.to_csv('../data/greek_cleanup/sorted_variants.csv',index=False,encoding='utf-8',quoting=csv.QUOTE_ALL)

In [124]:
new_df = pd.read_csv('../data/greek_cleanup/sorted_variants.csv',encoding='utf-8',quotechar='"')
new_df

Unnamed: 0,URL,Name,Variants
0,https://viaf.org/viaf/16544861/,"Iuba II, Mauretania, König, v50-23",2 ዩባ; Juba II 0052? av. J.-C.-0023? roi de Mau...
1,https://viaf.org/viaf/66855775/,Abaris Scythicus v6./5. Jh.,Abaris Scythicus v6./5. Jh.
2,https://viaf.org/viaf/49613664/,"Abas, Historicus, v2. Jh",Abas; Abas Historicus v2. Jh.
3,http://viaf.org/viaf/52080722/,Abascantus Lugdunensis ca. 1.Jh.n.Chr,Abascantus; Abascantus Lugdunensis ca. 1. Jh.
4,https://viaf.org/viaf/22527520/,"Ablabius, Rhetor, 5. Jh. n. Chr",Ablabius; Ablabius Rhetor 5. Jh. n. Chr; Ablab...
...,...,...,...
2205,https://viaf.org/viaf/42631297,"Hippocrates, of Chius",Ἱπποκράτης; Hippocrates Chius v470-v410
2206,https://viaf.org/viaf/65318710/,"Hippodamus, of Miletus","Ἱππόδαμος; Hippodamus, of Miletus; Hippodamus ..."
2207,https://viaf.org/viaf/76706212/,"Hipparchus, Filius, Pisistrati, 6. Jh. v. Chr",Ἵππαρχος; Хипарх; Hipparchus Filius Pisistrati...
2208,https://viaf.org/viaf/49614096,"Hippys, Rheginus, 5./4. Jh. v. Chr.",Ἵππυς; Hippys Rheginus 5./4. Jh. v. Chr; Hippy...


In [125]:
exploded_variants = new_df['Variants'].str.split('; ').explode()

In [126]:
greek = pd.concat([new_df['Name'],exploded_variants])

In [127]:
tlg = pd.read_csv('../data/greek_cleanup/greek-authors.csv')

In [128]:
tlg

Unnamed: 0,Name,Label
0,Abramius,Greek
1,Abydenus,Greek
2,Acacius,Greek
3,Acacius Sabaita,Greek
4,Acesander,Greek
...,...,...
25735,Tucídides ca. 460-ca. 400 a. C,Greek
25736,Tucídides ca. 460-ca. 400 a.,Greek
25737,Xenophon,Greek
25738,Yamblico,Greek


In [129]:
greek_frame = greek.to_frame()

In [130]:
greek_frame['Label'] = "Greek"

In [131]:
greek_frame = greek_frame.rename(columns={0:'Name'})

In [132]:
greek_frame

Unnamed: 0,Name,Label
0,"Iuba II, Mauretania, König, v50-23",Greek
1,Abaris Scythicus v6./5. Jh.,Greek
2,"Abas, Historicus, v2. Jh",Greek
3,Abascantus Lugdunensis ca. 1.Jh.n.Chr,Greek
4,"Ablabius, Rhetor, 5. Jh. n. Chr",Greek
...,...,...
2208,Hippys 5./4. Jh. v. Chr Rheginus,Greek
2208,Hippys Rheginus v5./4. Jh.,Greek
2208,"Hippys, Rheginus",Greek
2209,卢波库斯,Greek


In [133]:
greek_authors = pd.concat([tlg,greek_frame])

In [134]:
greek_authors_deduped = greek_authors.drop_duplicates(subset='Name')

In [135]:
greek_authors_deduped

Unnamed: 0,Name,Label
0,Abramius,Greek
1,Abydenus,Greek
2,Acacius,Greek
3,Acacius Sabaita,Greek
4,Acesander,Greek
...,...,...
2208,Hippys 5./4. Jh. v. Chr Rheginus,Greek
2208,Hippys Rheginus v5./4. Jh.,Greek
2208,"Hippys, Rheginus",Greek
2209,卢波库斯,Greek


In [136]:
# Read in the Latin authors
latin = pd.read_csv('../data/greek_cleanup/authors.csv')

In [None]:
latin = latin[latin['Language'] == 'Latin']
latin

Unnamed: 0,DLL Identifier (Author),Authorized Name,Language,Variant
2210,A1868,"Herryson, Joannes",Latin,"Herryson, Joannes floruit=15th Century A.D."
2211,A1868,"Herryson, Joannes",Latin,Joannes Herryson
2212,A1868,"Herryson, Joannes",Latin,John Herryson
2213,A1868,"Herryson, Joannes",Latin,"Heryyson, Joannes floruit=15th Century A.D."
2214,A1868,"Herryson, Joannes",Latin,"Herryson, xoannes floruit=15th Century A.D."
...,...,...,...,...
37530,A910,"Calpurnius Siculus, Titus",Latin,"Calpurnius Siculus, Titus, činný 1. století"
37531,A910,"Calpurnius Siculus, Titus",Latin,"Calpurnius Siculus, Titus, fl. 60 na Chr."
37532,A910,"Calpurnius Siculus, Titus",Latin,"Calpurnius Siculus, Titus."
37533,A910,"Calpurnius Siculus, Titus",Latin,Titus Calpurnius Siculus


In [140]:
sorted_latin = latin.sort_values(axis=0,by='Authorized Name')

In [141]:
sorted_latin.to_csv('../data/greek_cleanup/latin_authors.csv',index=False,quoting=csv.QUOTE_ALL)

In [142]:
greek_list = greek_authors_deduped['Name'].to_list()
latin_not_greek = sorted_latin[~sorted_latin['Authorized Name'].isin(greek_list)]

In [143]:
len(latin_not_greek)

34163

In [146]:
latin_not_greek.columns

Index(['DLL Identifier (Author)', 'Authorized Name', 'Language', 'Variant'], dtype='object')

In [145]:
latin_not_greek.to_csv('../data/greek_cleanup/latin_authors.csv',index=False,quoting=csv.QUOTE_ALL)

In [149]:
latin_not_greek

Unnamed: 0,DLL Identifier (Author),Authorized Name,Language,Variant
3984,A3060,Abbo Metensis ca. 627 - 643,Latin,Abbo Metensis ca. ca. 627 - 643
3987,A3060,Abbo Metensis ca. 627 - 643,Latin,Goerik Metský biskup a světec
3986,A3060,Abbo Metensis ca. 627 - 643,Latin,Abbo Metensis ca. 627 - 643
3985,A3060,Abbo Metensis ca. 627 - 643,Latin,Goerik Metský biskup a světec
3983,A3060,Abbo Metensis ca. 627 - 643,Latin,Abbo Metensis ca. 627 – 643
...,...,...,...,...
18846,A4487,"Ælnoth, monk of St. Augustine, Canterbury, act...",Latin,Aelnoth (10..-11..).
18845,A4487,"Ælnoth, monk of St. Augustine, Canterbury, act...",Latin,Aelnothus Cantuariensis ca. 12. Jh.
18844,A4487,"Ælnoth, monk of St. Augustine, Canterbury, act...",Latin,Aelnothus Cantuariensis ca. 12. Jh.
18842,A4487,"Ælnoth, monk of St. Augustine, Canterbury, act...",Latin,Aelnothus


In [152]:
# Grouping by 'name' and joining 'variant' with semicolon
latin_not_greek['Variant'] = latin_not_greek['Variant'].astype("str")
collapsed_df = latin_not_greek.groupby('Authorized Name')['Variant'].agg('; '.join).reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  latin_not_greek['Variant'] = latin_not_greek['Variant'].astype("str")


In [153]:
collapsed_df.to_csv('../data/greek_cleanup/latin_authors.csv',index=False,quoting=csv.QUOTE_ALL)

In [154]:
latin = pd.read_csv('../data/greek_cleanup/latin_authors.csv',encoding='utf-8',quotechar='"')

In [155]:
latin

Unnamed: 0,Authorized Name,Variant
0,Abbo Metensis ca. 627 - 643,Abbo Metensis ca. ca. 627 - 643; Goerik Metský...
1,"Abbo, Monk of St. Germain, approximately 850-a...",Abbo Sangermanus apie 850-apie 923; Abbon ca 8...
2,"Abbo, of Fleury, Saint","Abbo, Floriacensis, ca. 945-1004; Abbo Floriac..."
3,"Abelard, Peter","Pietro Abelardo, 1079-1142; Abaelard, Peter 10..."
4,"Abercius, Saint, Bishop of Hierapolis, -approx...","Abercius Saint, Bishop of Hierapolis; Abercius..."
...,...,...
3031,"Zumpt, A. W. (August Wilhelm), 1815-1877",August Wilhelm Zumpt deutscher Epigraphiker un...
3032,"Zwingli, Ulrich, 1484-1531","Zwingli, U. 1484-1531 Ulrich; Zwingli, Huldryc..."
3033,"Zycha, Joseph","Iosephus Zcyha; Zycha, Josephus; Zycha, Joseph..."
3034,"Zylvelt, Anthony van, 1640-1695","Zylvelt, Anthony van (około 1640-1695).; Zylve..."


In [156]:
exploded = latin['Variant'].str.split('; ').explode()

In [157]:
exploded

0           Abbo Metensis ca. ca. 627 - 643
0             Goerik Metský biskup a světec
0               Abbo Metensis ca. 627 - 643
0             Goerik Metský biskup a světec
0               Abbo Metensis ca. 627 – 643
                       ...                 
3035                   Aelnoth (10..-11..).
3035    Aelnothus Cantuariensis ca. 12. Jh.
3035    Aelnothus Cantuariensis ca. 12. Jh.
3035                              Aelnothus
3035                   Aelnoth (10..-11..).
Name: Variant, Length: 33992, dtype: object

In [177]:
combined_latin = pd.concat([latin['Authorized Name'],exploded])

In [178]:
combined_latin_frame = combined_latin.to_frame()

In [179]:
combined_latin_frame = combined_latin_frame.rename(columns={0:'Name'})
combined_latin_frame['Label'] = "Latin"

In [180]:
combined_latin_frame

Unnamed: 0,Name,Label
0,Abbo Metensis ca. 627 - 643,Latin
1,"Abbo, Monk of St. Germain, approximately 850-a...",Latin
2,"Abbo, of Fleury, Saint",Latin
3,"Abelard, Peter",Latin
4,"Abercius, Saint, Bishop of Hierapolis, -approx...",Latin
...,...,...
3035,Aelnoth (10..-11..).,Latin
3035,Aelnothus Cantuariensis ca. 12. Jh.,Latin
3035,Aelnothus Cantuariensis ca. 12. Jh.,Latin
3035,Aelnothus,Latin


In [192]:
greek_and_latin = pd.concat([greek_authors_deduped,combined_latin_frame])

In [193]:
greek_and_latin

Unnamed: 0,Name,Label
0,Abramius,Greek
1,Abydenus,Greek
2,Acacius,Greek
3,Acacius Sabaita,Greek
4,Acesander,Greek
...,...,...
3035,Aelnoth (10..-11..).,Latin
3035,Aelnothus Cantuariensis ca. 12. Jh.,Latin
3035,Aelnothus Cantuariensis ca. 12. Jh.,Latin
3035,Aelnothus,Latin


In [194]:
greek_and_latin['Label'].value_counts()

Label
Latin    37028
Greek    13120
Name: count, dtype: int64

In [195]:
deduped_greek_and_latin = greek_and_latin.drop_duplicates(subset='Name')


In [196]:
deduped_greek_and_latin['Label'].value_counts()

Label
Latin    27323
Greek    13120
Name: count, dtype: int64

In [197]:
deduped_greek_and_latin.to_csv('../data/deduped_greek_and_latin.csv',index=False,encoding='utf-8',quoting=csv.QUOTE_ALL)