# Spelling Orthography Update

This project uses machine learning to update text from an older spelling orthography to a modern spelling orthography. The focus on this project will be on updating the spelling orthography from the 1908 Chamorro Bible.

**Name:** Schyuler Lujan<br>
**Date Started:** 6-Nov-2024<br>
**Date Completed:** In Progress

In [113]:
# Import libraries
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import nltk
from nltk.util import ngrams
from collections import Counter

# Scrape Text Data

Scrape the text data from the chamorrobible.org website and format the text into a dataset of unduplicated words.

In [2]:
# All text can be found at this URL
website = 'http://chamorrobible.org/download/YSantaBiblia-Chamorro-HTML.htm'

In [3]:
page = requests.get(website)
soup = BeautifulSoup(page.content, "html.parser")

In [20]:
# Get all the text
ch_bible_text = soup.get_text()

In [1]:
# Check the text
#print(ch_bible_text)

# Clean Text Data

In [110]:
# Remove numbers
text_clean = re.sub(r"\d+", " ", ch_bible_text)

In [69]:
# Remove punctuation
text_clean = re.sub(r"[^\w\s]", "", text_clean)

In [70]:
# Standardize text by converting to lowercase
text_clean = text_clean.lower()

In [71]:
# Split text by word and store in a list of duplicated words
total_word_list = text_clean.split()

# Exploratory Analysis

## Basic Descriptive Statistics

### Word Counts

In [72]:
# Get total word count
total_word_count = len(total_word_list)
print(f"The total word count: {total_word_count:,}")

The total word count: 132,875


In [73]:
# Get unique word count
unique_word_list = set(total_word_list)
print(f"The unique word count: {len(unique_word_list):,}")

The unique word count: 11,193


### Word Lengths

In [101]:
# Get average word length
word_length = [] # Initialize list

for word in total_word_list:
    word_length.append(len(word))

In [103]:
average_word_length = sum(word_length) / total_word_count
print(f"Average word length: {average_word_length} characters")

Average word length: 4.725388523047977 characters


In [108]:
# Get maximum word length
word_length.sort(reverse=True) # Sort in descending order
max_word_length = word_length[0]
min_word_length = word_length[-1]
print(f"Longest word: {max_word_length} characters")
print(f"Shortest word: {min_word_length} character")

Longest word: 22 characters
Shortest word: 1 character


### Character Counts

In [80]:
characters = {} # Initialize dictionary for storing characters
total_character_count = 0 # For holding the total character count

# Character counts
for word in total_word_list:
    for char in word:
        total_character_count += 1
        if char in characters:
            characters[char] += 1
        else:
            characters[char] = 1

In [81]:
print(f"The total number of characters in the text: {total_character_count:,}")

The total number of characters in the text: 627,886


## Frequency analysis

### Character Frequencies

In [76]:
# Convert dictionary to a list of tuples before converting to dataframe
character_list = [] # Initialize list
for char in characters:
    character_list.append((char, characters[char]))

In [78]:
# Convert to dataframe and view results in descending order
character_frequency_df = pd.DataFrame(character_list, columns=["Character", "Frequency"])
# Sort dataframe by frequency
character_frequency_df.sort_values(by="Frequency", ascending=False, inplace=True)
print(character_frequency_df)

   Character  Frequency
2          a     130786
3          n      60208
9          o      47056
6          i      46099
14         e      39671
0          y      38423
12         u      34416
15         g      34305
1          s      30192
4          t      29560
10         j      28684
8          m      27442
7          l      18041
17         p       9827
11         c       9349
13         r       7984
19         d       7472
18         ñ       7411
20         f       6352
5          b       4211
23         ü       4050
16         h       3452
25         q       1023
22         v        870
27         â        646
24         á        160
26         é         85
28         ó         51
21         ú         23
29         í         18
33         x         12
30         ô          2
31         z          2
32         k          2
34         ã          1


### Word Frequencies

In [82]:
words = {} # Initialize dictionary to store word counts

# Iterate through word list and count each word
for word in total_word_list:
    if word in words:
        words[word] += 1
    else:
        words[word] = 1

In [92]:
word_frequencies = [] # Initialize list
for word in words:
    word_frequencies.append((word, words[word]))

In [97]:
# Convert to a dataframe
word_frequencies_df = pd.DataFrame(word_frequencies, columns=["Word", "Frequency"])
word_frequencies_df.sort_values(by="Frequency", ascending=False, inplace=True)
# View top 100 words
print(word_frequencies_df.head(50))

          Word  Frequency
0            y      15477
57          ya       7730
83          na       4631
10         gui       4123
4         sija       3631
9          yan       2958
40          ni       2813
15          si       2354
181       para       1629
41          ti       1487
81          sa       1450
171      güiya       1242
39      taotao       1240
216       anae       1117
50         ayo       1087
129     ilegña       1077
130         nu       1061
52         lao       1020
53       guiya       1013
73        todo        965
153      jamyo        959
183       yuus        848
391       este        824
239        güe        818
137        pot        811
55       jeova        749
5371     jesus        681
119      guajo        668
114        nae        662
17          un        654
185       jago        599
182         as        522
90        jafa        510
97        tano        476
150          o        424
206         yo        411
531      locue        402
352       es

### Lexical Diversity

Assess how many words are used more than once, verses the number of words used only one time in the entire text to understand the diversity of the text vs. amount of repetition.

In [2]:
# Words occuring once vs. Repeated words

## N-Grams Analysis

In [112]:
# Get original text and remove numbers
text = re.sub(r"\d+", " ", ch_bible_text)

### Find most common word pairings (2 words)

### Find most common phrases (3 words)

# Export Data to Create Training Set

A small, manually created dataset of sample pairs will be created from the text. It will contain the old orthography spelling mapped to the equivalent new orthography spelling, and this dataset will be used to train our machine learning models.

In [116]:
# Export word frequency dataframe to CSV file
word_frequencies_df.to_csv('chamorro_bible_words.csv', index=False, encoding="utf-8")

# Train Models

# Evaluate Model Performance

# Final Model Selection

# Export Final Dataset

# Conclusions

# Opportunities for Future Analysis