# Spelling Orthography Update

This project uses machine learning to update text from an older spelling orthography to a modern spelling orthography. The focus on this project will be on updating the spelling orthography from the 1908 Chamorro Bible.

**Name:** Schyuler Lujan<br>
**Date Started:** 6-Nov-2024<br>
**Date Completed:** In Progress<br>
**Last Updated:** 7-Nov-2024

In [36]:
# Import libraries
from bs4 import BeautifulSoup # For web scraping
import requests # For web scraping
import re # For text cleaning
import pandas as pd # For analysis
import matplotlib.pyplot as plt # For analysis

# Scrape Text Data

Scrape the text data from the chamorrobible.org website and format the text into a dataset of unduplicated words.

In [37]:
# All text can be found at this URL
website = 'http://chamorrobible.org/download/YSantaBiblia-Chamorro-HTML.htm'

In [38]:
page = requests.get(website)
soup = BeautifulSoup(page.content, "html.parser")

In [39]:
# Get all the text
ch_bible_text = soup.get_text()

In [40]:
# Check the text
#print(ch_bible_text)

# Clean Text Data

In [41]:
# Remove numbers
text_clean = re.sub(r"\d+", " ", ch_bible_text)

In [42]:
# Remove punctuation
text_clean = re.sub(r"[^\w\s]", "", text_clean)

In [43]:
# Standardize text by converting to lowercase
text_clean = text_clean.lower()

In [44]:
# Split text by word and store in a list of duplicated words
total_word_list = text_clean.split()

# Exploratory Analysis

## Basic Descriptive Statistics

### Word Counts

In [45]:
# Get total word count
total_word_count = len(total_word_list)
print(f"The total word count: {total_word_count:,}")

The total word count: 132,875


In [46]:
# Get unique word count
unique_word_set = set(total_word_list)
unique_word_count = len(unique_word_set)
print(f"The unique word count: {unique_word_count:,}")

The unique word count: 11,193


### Word Lengths

In [47]:
# Get average word length
word_length = [] # Initialize list

for word in total_word_list:
    word_length.append(len(word))

In [48]:
average_word_length = sum(word_length) / total_word_count
print(f"Average word length: {average_word_length} characters")

Average word length: 4.725388523047977 characters


In [49]:
# Get maximum word length
word_length.sort(reverse=True) # Sort in descending order
max_word_length = word_length[0]
min_word_length = word_length[-1]
print(f"Longest word: {max_word_length} characters")
print(f"Shortest word: {min_word_length} character")

Longest word: 22 characters
Shortest word: 1 character


### Character Counts

In [50]:
characters = {} # Initialize dictionary for storing characters
total_character_count = 0 # For holding the total character count

# Character counts
for word in total_word_list:
    for char in word:
        total_character_count += 1
        if char in characters:
            characters[char] += 1
        else:
            characters[char] = 1

In [51]:
print(f"The total number of characters in the text: {total_character_count:,}")

The total number of characters in the text: 627,886


## Frequency analysis

### Character Frequencies

In [52]:
# Convert dictionary to a list of tuples before converting to dataframe
character_list = [] # Initialize list
for char in characters:
    character_list.append((char, characters[char]))

In [53]:
# Convert to dataframe and view results in descending order
character_frequency_df = pd.DataFrame(character_list, columns=["Character", "Frequency"])
# Sort dataframe by frequency
character_frequency_df.sort_values(by="Frequency", ascending=False, inplace=True)
print(character_frequency_df)

   Character  Frequency
2          a     130786
3          n      60208
9          o      47056
6          i      46099
14         e      39671
0          y      38423
12         u      34416
15         g      34305
1          s      30192
4          t      29560
10         j      28684
8          m      27442
7          l      18041
17         p       9827
11         c       9349
13         r       7984
19         d       7472
18         ñ       7411
20         f       6352
5          b       4211
23         ü       4050
16         h       3452
25         q       1023
22         v        870
27         â        646
24         á        160
26         é         85
28         ó         51
21         ú         23
29         í         18
33         x         12
30         ô          2
31         z          2
32         k          2
34         ã          1


### Word Frequencies

In [54]:
words = {} # Initialize dictionary to store word counts

# Iterate through word list and count each word
for word in total_word_list:
    if word in words:
        words[word] += 1
    else:
        words[word] = 1

In [55]:
word_frequencies = [] # Initialize list
for word in words:
    word_frequencies.append((word, words[word]))

In [56]:
# Convert to a dataframe
word_frequencies_df = pd.DataFrame(word_frequencies, columns=["Word", "Frequency"])
word_frequencies_df.sort_values(by="Frequency", ascending=False, inplace=True)
# View top 100 words
print(word_frequencies_df.head(50))

          Word  Frequency
0            y      15477
57          ya       7730
83          na       4631
10         gui       4123
4         sija       3631
9          yan       2958
40          ni       2813
15          si       2354
181       para       1629
41          ti       1487
81          sa       1450
171      güiya       1242
39      taotao       1240
216       anae       1117
50         ayo       1087
129     ilegña       1077
130         nu       1061
52         lao       1020
53       guiya       1013
73        todo        965
153      jamyo        959
183       yuus        848
391       este        824
239        güe        818
137        pot        811
55       jeova        749
5371     jesus        681
119      guajo        668
114        nae        662
17          un        654
185       jago        599
182         as        522
90        jafa        510
97        tano        476
150          o        424
206         yo        411
531      locue        402
352       es

### Lexical Diversity

Assess how many words are used more than once, verses the number of words used only one time in the entire text to understand the diversity of the text vs. amount of repetition.

In [57]:
# Create a dataframe of non-repeated words
not_repeated_df = word_frequencies_df[word_frequencies_df["Frequency"] == 1]

# Count the number of words ocurring only once
total_not_repeated = len(not_repeated_df)
print(f"Total words occuring once: {total_not_repeated:,}")
print(f"Unique words are {(total_not_repeated / unique_word_count)} of the dataset")

Total words occuring once: 6,207
Unique words are 0.5545430179576522 of the dataset


In [58]:
# Create a dataframe of repeated words
repeated_words_df = word_frequencies_df[word_frequencies_df["Frequency"] > 1]
# Count the number of repeated words
total_repeated = len(repeated_words_df)
print(f"Total words repeated: {total_repeated:,}")
print(f"Repeated words are {total_repeated / unique_word_count} of the dataset")

Total words repeated: 4,986
Repeated words are 0.4454569820423479 of the dataset


In [71]:
# Get descriptive statistics for Frequency on repeated words
repeated_words_df["Frequency"].describe()

count     4986.000000
mean        25.404733
std        283.286989
min          2.000000
25%          2.000000
50%          4.000000
75%          8.000000
max      15477.000000
Name: Frequency, dtype: float64

# Export Data to Create Training Set

Currently, there is no labelled dataset available to train our models. Therefore, I will need to manually create a dataset of sample pairs for training these models. The dataset will come from the unique word set, and the chosen words will be mapped to their new orthography equivalent.<p>
    
**Size of Sample** <br>
    
I will start with 10% of the total unique words in the text. Although more data is always better, since I must manually create the labelled training set, I will start with 10% to minimize unnecessary, manual work.<p>
    
**Creating a Representative Sample**<br>

To ensure a representative sample of words in our training set, I will create a stratified random sample by word frequency. **(I will include more notes from the lexical diversity analysis.)** The aim is to prevent over-fitting based upon the highest frequency words and still capture unique words, while also ensuring that the model will perform well on high frequency words. To create this stratified random sample, I will divide the unique word dataset into different groups based on word frequency. Then I will randomly sample from these groups to create our entire sample set. <p>
    
**Considerations for Poor Model Performance**
    
There is the possibility that the models will perform poorly on this manually created training set. Since the resource cost of increasing the sample training set is so high, I will only increase the sample size incrementally. One approach is to identify the model's lowest confidence words, detect any common patterns in these words (ie: specific affixes) and include more similarly affixed words in the sample set to improve performance

In [None]:
# Export word frequency dataframe to CSV file, for additional reference
#word_frequencies_df.to_csv('chamorro_bible_words.csv', index=False, encoding="utf-8")

In [None]:
# Determine the number of words needed for 10% of unique words
sample_size = int(.1 * len(unique_word_set))
print(f"Sample size: {sample_size:,}")

# Train Models

# Evaluate Model Performance

# Final Model Selection

# Export Final Dataset

# Conclusions

# Opportunities for Future Analysis