# First Step: Assemble the Latin and Greek Data

From a previous project, I have a file of variant names for authors of Latin texts with their DLL Author ID (`data/names2.csv`). I also have a file of variant names assembled from the [Virtual International Authority File (VIAF)](https://viaf.org/). I need to combine those into one file.

I also have files of names of authors of Greek texts, but I don't have many variant names for them. I'll use Beautiful Soup below to pull down some variant names from VIAF.

The goal is to have a single CSV that records the DLL Author ID (or similar), Authorized Name, and Variant Name for both Greek and Latin authors. In the next step I'll transform the data so that each author is in one row, instead of one row per variant name.

In [3]:
import pandas as pd

# Open the Latin CSV files
dll_variant_names = pd.read_csv('data/names2.csv')
dll_authorized_names = pd.read_csv('data/variant-names.csv') 

In [4]:
print(dll_variant_names.info())
print(dll_authorized_names.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34876 entries, 0 to 34875
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   author         34876 non-null  object
 1   dll_author_id  34876 non-null  object
dtypes: object(2)
memory usage: 545.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3228 entries, 0 to 3227
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   DLL Identifier                  3227 non-null   object 
 1   Authorized Name                 3227 non-null   object 
 2   Short Name                      51 non-null     object 
 3   Author Name English             1252 non-null   object 
 4   Author Name Latin               2197 non-null   object 
 5   Author Name Native Language     1045 non-null   object 
 6   BNE URL                         562 non-null    object 
 7   BNF URL     

Since all the variant names in `dll_authorized_names` are already in dll_variant_names, I don't need to keep them here. I do, however, need the "Authorized Name" value.

In [5]:

dll_authorized_names = dll_authorized_names[['DLL Identifier','Authorized Name']]

In [6]:
print(dll_variant_names)

                                           author dll_author_id
0                             A. Cornelius Celsus         A5349
1                               Aage von Dänemark         A4246
2                      Aagesen, Svend, n. c. 1130         A4448
3      Ab Almeloveen, Theodoor Jansson, 1657-1712         A6040
4                       Abaelard, Peter 1079-1142         A5015
...                                           ...           ...
34871                              Leo, Friedkich         A6296
34872                              Leo, Friedrich         A6296
34873            Leo, Friedrich August, 1851-1914         A6296
34874                               Friedrich Leo         A6296
34875                              Lko, Friedrich         A6296

[34876 rows x 2 columns]


In [7]:
print(dll_authorized_names)

     DLL Identifier                                  Authorized Name
0             A5558      Albert, of Aachen, active 11th-12th century
1             A5552                      Marullo Tarcaniota, Michele
2             A5553                             Corvinus, Laurentius
3             A5261                              Claudianus Mamertus
4             A5260                     Sedulius, active 5th century
...             ...                                              ...
3223          A6292  Zangemeister, Karl Friedrich Wilhelm, 1837-1902
3224          A6293                          Rühl, Franz, 1845-1916
3225          A6294                       Kurfess, Alfons, 1889-1965
3226          A6295       Ahlberg, Axel W. (Axel Wilhelm), 1874-1951
3227          A6296                        Leo, Friedrich, 1851-1914

[3228 rows x 2 columns]


In [8]:
# Rename `DLL Identifier` to `dll_author_id` to match the first DataFrame
dll_authorized_names.rename(columns={"DLL Identifier": "dll_author_id"}, inplace=True)

# Merge DataFrames on the common column `dll_author_id`
merged_df = dll_variant_names.merge(dll_authorized_names, on="dll_author_id", how="left")

In [9]:
merged_df

Unnamed: 0,author,dll_author_id,Authorized Name
0,A. Cornelius Celsus,A5349,"Celsus, Aulus Cornelius"
1,Aage von Dänemark,A4246,"Augustinus, de Dacia, -1285"
2,"Aagesen, Svend, n. c. 1130",A4448,"Svend Aagesen, approximately 1130-"
3,"Ab Almeloveen, Theodoor Jansson, 1657-1712",A6040,"Almeloveen, Theodoor Jansson ab, 1657-1712"
4,"Abaelard, Peter 1079-1142",A5015,"Abelard, Peter"
...,...,...,...
34871,"Leo, Friedkich",A6296,"Leo, Friedrich, 1851-1914"
34872,"Leo, Friedrich",A6296,"Leo, Friedrich, 1851-1914"
34873,"Leo, Friedrich August, 1851-1914",A6296,"Leo, Friedrich, 1851-1914"
34874,Friedrich Leo,A6296,"Leo, Friedrich, 1851-1914"


In [10]:
new_df = merged_df[['dll_author_id','Authorized Name','author']]

In [11]:
new_df = new_df.rename(columns={"dll_author_id":"DLL ID","author":"Variant Names"})

In [12]:
# Add a column with a 1 for "Latin". I'll add 0 in this column for the Greek authors.
new_df['Latin'] = 1

In [13]:
new_df

Unnamed: 0,DLL ID,Authorized Name,Variant Names,Latin
0,A5349,"Celsus, Aulus Cornelius",A. Cornelius Celsus,1
1,A4246,"Augustinus, de Dacia, -1285",Aage von Dänemark,1
2,A4448,"Svend Aagesen, approximately 1130-","Aagesen, Svend, n. c. 1130",1
3,A6040,"Almeloveen, Theodoor Jansson ab, 1657-1712","Ab Almeloveen, Theodoor Jansson, 1657-1712",1
4,A5015,"Abelard, Peter","Abaelard, Peter 1079-1142",1
...,...,...,...,...
34871,A6296,"Leo, Friedrich, 1851-1914","Leo, Friedkich",1
34872,A6296,"Leo, Friedrich, 1851-1914","Leo, Friedrich",1
34873,A6296,"Leo, Friedrich, 1851-1914","Leo, Friedrich August, 1851-1914",1
34874,A6296,"Leo, Friedrich, 1851-1914",Friedrich Leo,1


In [19]:
# Read in the Greek names
df2 = pd.read_csv('data/greek.csv')

In [20]:
# Some of the Greek names are duplicates, so I can use the URL column to remove them.
df2_deduped = df2.drop_duplicates(subset='URL')

In [21]:
# Show the differences between the original Greek dataframe and the deduped one.
print(len(df2))
print(len(df2_deduped))

6029
2378


# Add Variant Names

The `URL` column in `greek.csv`, which I have copied from the wonderful folks at the Perseus Project, contains VIAF URLs. I can use BeautifulSoup to go to each URL and save the variant names that appear there.

In [23]:

import requests
from bs4 import BeautifulSoup

In [24]:
def extract_name_entries(url):
    """
    Extracts all <h2 class="nameEntry"> elements within <div id="Title"> from the given URL.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-GB,en;q=0.5'
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.encoding = 'utf-8'
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            title_div = soup.find('div', id='Title')
            if title_div:
                h2_tags = title_div.find_all('h2', class_='nameEntry')
                return [h2.get_text(separator=' ', strip=True) for h2 in h2_tags]
            else:
                return ["No <div id='Title'> found"]
        else:
            return [f"Failed to fetch URL (status code: {response.status_code})"]
    except Exception as e:
        return [f"Error: {str(e)}"]
    
df2_deduped['Variants'] = df2_deduped['URL'].apply(extract_name_entries)

# Flatten the lists in 'name_entries' column for readability (optional)
df2_deduped['Variants'] = df2_deduped['Variants'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_deduped['Variants'] = df2_deduped['URL'].apply(extract_name_entries)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_deduped['Variants'] = df2_deduped['Variants'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)


In [25]:
df2_deduped

Unnamed: 0,URL,Name,Variants
0,http://viaf.org/viaf/27455561,Lycus Rheginus 3. Jh. v. Chr,"Lycus Rheginus 3. Jh. v. Chr, Licos de Rhègion..."
1,https://viaf.org/viaf/100905484,"Nicephorus Saint, Patriarch of Constantinople","Nicéphore Ier, 0758?-0828, patriarche de Const..."
3,https://viaf.org/viaf/10236001,"Clitophon Rhodius, 1./2. Jh. v. Chr.","Clitofó, Clitophon Rhodius v1./2. Jh."
4,https://viaf.org/viaf/312800491,"Demetrius, of Phaleron, b. ca. 350 B.C.","Démétrios de Phalère 0350?-0283? av. J.-C., Δη..."
6,https://viaf.org/viaf/34843722,"Echembrotus Lyricus, 6. Jh. v. Chr","Equembrot, Echembrotus Lyricus 6. Jh. v. Chr, ..."
...,...,...,...
6024,https://viaf.org/viaf/9873664/,"Leo, VI, Emperor of the East 866-912","Léon VI, 0866-0912, empereur d'Orient, Leo VI...."
6025,https://viaf.org/viaf/99954624/,"Xenocrates, of Chalcedon, approximately 396 B....","Xenocrates, of Chalcedon, approximately 396 B...."
6026,https://www.viaf.org/22528402,"Thaletas, Musicus, um. 665 v. chr.","Taletes, Thaletas Musicus v665"
6027,https://www.viaf.org/viaf/39429158/,"Pachymérès, George, 1242-approximately 1310","Pachymère, Georges 1242-1310?, Pachymeres, Geo..."


In [27]:
df2_deduped['Latin'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_deduped['Latin'] = 0


In [28]:
df2_deduped

Unnamed: 0,URL,Name,Variants,Latin
0,http://viaf.org/viaf/27455561,Lycus Rheginus 3. Jh. v. Chr,"Lycus Rheginus 3. Jh. v. Chr, Licos de Rhègion...",0
1,https://viaf.org/viaf/100905484,"Nicephorus Saint, Patriarch of Constantinople","Nicéphore Ier, 0758?-0828, patriarche de Const...",0
3,https://viaf.org/viaf/10236001,"Clitophon Rhodius, 1./2. Jh. v. Chr.","Clitofó, Clitophon Rhodius v1./2. Jh.",0
4,https://viaf.org/viaf/312800491,"Demetrius, of Phaleron, b. ca. 350 B.C.","Démétrios de Phalère 0350?-0283? av. J.-C., Δη...",0
6,https://viaf.org/viaf/34843722,"Echembrotus Lyricus, 6. Jh. v. Chr","Equembrot, Echembrotus Lyricus 6. Jh. v. Chr, ...",0
...,...,...,...,...
6024,https://viaf.org/viaf/9873664/,"Leo, VI, Emperor of the East 866-912","Léon VI, 0866-0912, empereur d'Orient, Leo VI....",0
6025,https://viaf.org/viaf/99954624/,"Xenocrates, of Chalcedon, approximately 396 B....","Xenocrates, of Chalcedon, approximately 396 B....",0
6026,https://www.viaf.org/22528402,"Thaletas, Musicus, um. 665 v. chr.","Taletes, Thaletas Musicus v665",0
6027,https://www.viaf.org/viaf/39429158/,"Pachymérès, George, 1242-approximately 1310","Pachymère, Georges 1242-1310?, Pachymeres, Geo...",0


In [32]:
# Write the two dataframes to CSV files for use in the next step.
df2_deduped.to_csv('fresh/deduped_greek_with_authorized_and_variants.csv', index=False)
new_df.to_csv('fresh/dllid_variant_authorized.csv', index=False)