# Augmenting the Author Data

I need to make sure that the training set multiple instances of every label (i.e., value in `dll_author_id`). First of all, I should see how many unique labels there are and how many instances there are of each label.



In [32]:
# Import the data from a prepared CSV

import pandas as pd

df = pd.read_csv('output/all_names_deduplicated.csv')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24708 entries, 0 to 24707
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   author         24708 non-null  object
 1   dll_author_id  24708 non-null  object
dtypes: object(2)
memory usage: 386.2+ KB
None


In [33]:
counts = df['dll_author_id'].value_counts()
print(counts)

dll_author_id
A4644    38
A4379    35
A4322    35
A5377    35
A4830    34
         ..
A6277     1
A3838     1
A3606     1
A5713     1
A6134     1
Name: count, Length: 3137, dtype: int64


There are several labels that have only one instance. How many?

In [34]:
num_single_occurrences = (counts == 1).sum()
print(num_single_occurrences)

148


148 seems like a lot. Since the maximum number is 43 and the minimum is 1, it would be nice to know the average count.

In [35]:
average = counts.mean()
average

7.876314950589736

I'm going to augment the data so that there are at least 8 instances of each author-ID combination. Since each row of the existing dataset has a variant version of an author's name, I shouldn't just copy the names of those with fewer than eight instances, so I'll artificially introduce some variant spellings. I'll use the `textblob` package to introduce some variant spellings of author names.

In [36]:
# Code generated by Chat-GPT (https://chatgpt.com/share/e54fc0b4-c8b0-4069-9509-a89487372ccf)

from textblob import Word
import random

def generate_variations(name, num_variations=8, existing_variations=None):
    if existing_variations is None:
        existing_variations = set()
        
    while len(existing_variations) < num_variations:
        variation = list(name)
        print(f"Generating variations for {name}. Existing count: {len(existing_variations)}")
        num_changes = random.randint(1, 2)  # Randomly choose the number of changes
        for _ in range(num_changes):
            idx = random.randint(0, len(variation) - 1)
            if variation[idx].isalpha():
                # Randomly change the letter
                variation[idx] = chr(random.randint(97, 122))
        
        # Create a string from the list
        variation_str = ''.join(variation)
        
        # Use TextBlob to correct the spelling slightly to create realistic variations
        corrected = str(Word(variation_str).spellcheck()[0][0])
        existing_variations.add(corrected)
    
    return list(existing_variations)

# To store the new rows
new_rows = []

# Generate and add variants to the DataFrame
for index, row in df.iterrows():
    author = row['author']
    author_id = row['dll_author_id']
    
# Count existing variations for each author ID
variation_counts = df.groupby('dll_author_id').size().reset_index(name='count')

# Filter for items with less than 8 variations
df_to_expand = variation_counts[variation_counts['count'] < 8]

# Generate and add the necessary variations
new_rows = []

for _, row in df_to_expand.iterrows():
    author_id = row['dll_author_id']
    count = row['count']
    
    # Get all existing variations for this author ID
    existing_variations = set(df[df['dll_author_id'] == author_id]['author'].tolist())
    
    # Calculate how many more variations are needed
    needed_variations = 8 - count
    
    # Generate the required number of variations
    new_variations = generate_variations(list(existing_variations)[0], num_variations=needed_variations, existing_variations=existing_variations)
    
    # Add each new variant to the new_rows list
    for variant in new_variations:
        print(variant)
        new_rows.append({'author': variant, 'dll_author_id': author_id})

# Step 4: Convert new rows to a DataFrame and append to the original DataFrame
new_rows_df = pd.DataFrame(new_rows)
df_extended = pd.concat([df, new_rows_df], ignore_index=True)

# Display the updated DataFrame
print(df_extended)

Generating variations for Herryson, Joannes floruit=15th Century A.D.. Existing count: 3
Generating variations for Herryson, Joannes floruit=15th Century A.D.. Existing count: 3
Generating variations for Herryson, Joannes floruit=15th Century A.D.. Existing count: 4
Generating variations for Herryson, Joannes floruit=15th Century A.D.. Existing count: 4
Heryyson, Joannes floruit=15th Century A.D.
Herryson, xoannes floruit=15th Century A.D.
John Herryson
Herryson, Joannes floruit=15th Century A.D.
Joannes Herryson
Stratford, Johannes ca. 1275-1348
Johannes Stratford
Stratford, John ca 1275-1348
John Stratford, 1275?-1348
John Stratford
Stratford, John, m. 1348
Generating variations for Nicòmac, de Gerasa, actiu segle I. Existing count: 2
Generating variations for Nicòmac, de Gerasa, actiu segle I. Existing count: 3
Generating variations for Nicòmac, de Gerasa, actiu segle I. Existing count: 3
Generating variations for Nicòmac, de Gerasa, actiu segle I. Existing count: 3
Generating varia

In [37]:
# Check the result
augmented_value_counts = df_extended['dll_author_id'].value_counts()
print(augmented_value_counts)

extended_author = df_extended[df_extended['dll_author_id'] == 'A4536']
print(extended_author)

dll_author_id
A4644    38
A4379    35
A4322    35
A5377    35
A4830    34
         ..
A3225     8
A3754     8
A6293     8
A4181     8
A3749     8
Name: count, Length: 3137, dtype: int64
                                                  author dll_author_id
4226   Burginda Burginda (fl. 7th–early 8th cent.), a...         A4536
4227                                 Burginda ca. um 700         A4536
30389  Burginda Burginda (fl. 7th–early 8th ceni.), a...         A4536
30390  Burginda Burginda (fl. 7th–early 8th cent.), a...         A4536
30391                                Burginda ca. um 700         A4536
30392  Burginda Burgznda (fl. 7th–early 8th cent.), a...         A4536
30393  Burginda Burginda (fl. 7th–eawly 8th cent.), a...         A4536
30394  Burginda Burginda (fl. 7th–early 8th cent.), a...         A4536


In [38]:
len(df_extended)

34871

In [39]:
# Save it to a CSV
df_extended.to_csv('output/extended_augmented_all_names_deduplicated.csv', index=False)