# GNT word list in BetaCode from TF

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Create list of Greek words in Unicode</a>
* <a href="#bullet3">3 - Analyze Unicode accent storage</a>
* <a href="#bullet4">4 - Convert the word list into betacode</a>
* <a href="#bullet5">5 - Create a JSON dictionairy</a>
* <a href="#bullet6">6 - Atribution and footnotes</a>
* <a href="#bullet7">7 - Required libraries</a>
* <a href="#bullet8">8 - Notebook version</a>


# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook uses feature [betacode]()  to generate a list of all morphemes in the Greek New Testament encoded in BetaCode. This list will be used as input to the Morpheus morphological tagger. 

# 2 - Load TF with N1904addons <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Loading the Text-Fabric code
from tf.fabric import Fabric
from tf.app import use

In [3]:
# Load the N1904-TF app and data with the additional features
A = use ("CenterBLC/N1904", version="1.0.0", mod="tonyjurg/N1904addons/tf/", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/CenterBLC/N1904/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/CenterBLC/N1904/blob/main/docs/viewtypes.md#start) for more information on viewtypes

# 3 - Get all unique betacode words <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

In [8]:
betacodeWords=[]

betacodeFreqList=F.betacode.freqList('word')
for betacode,freq in betacodeFreqList:
    betacodeWords.append(betacode)
betacodeWords[:10]

['kai\\',
 'o(',
 'e)n',
 'de\\',
 'tou=',
 'ei)s',
 'to\\',
 'to\\n',
 'th\\n',
 'au)tou=']

Note that in this list 'kai\\', while in the ASCII file it will become 'kai\', etc. (due to escape sequency).

# 4 - store to file <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

In [2]:
import unicodedata

# Path to the input file
inputFile = 'uniqueWords.txt'

# Function to check if a word uses pre-composed characters
def checkAccentType(word):
    """
    Determine if a word uses pre-composed characters or separate accent definitions.
    
    Args:
        word (str): The Greek word to check.

    Returns:
        str: "precomposed" if the word uses pre-composed characters,
             "separate accents" if it uses separate accent definitions.
    """
    normalizedNFC = unicodedata.normalize('NFC', word)  # Pre-composed form
    normalizedNFD = unicodedata.normalize('NFD', word)  # Decomposed form

    if word == normalizedNFC:
        return "precomposed"
    elif word == normalizedNFD:
        return "separate accents"
    else:
        return "mixed"

# Read Greek words from the input file
with open(inputFile, 'r', encoding='utf-8') as inFile:
    greekWords = inFile.read().splitlines()

# Analyze each word for accent storage
accentAnalysis = {word: checkAccentType(word) for word in greekWords}

# Print results
precomposedCount = sum(1 for v in accentAnalysis.values() if v == "precomposed")
separateAccentsCount = sum(1 for v in accentAnalysis.values() if v == "separate accents")
mixedCount = sum(1 for v in accentAnalysis.values() if v == "mixed")

print(f"Precomposed: {precomposedCount}")
print(f"Separate accents: {separateAccentsCount}")
print(f"Mixed: {mixedCount}")

# Save the results to a file
outputFile = 'accentAnalysis.json'
import json
with open(outputFile, 'w', encoding='utf-8') as outFile:
    json.dump(accentAnalysis, outFile, ensure_ascii=False, indent=4)

print(f"Accent analysis saved to {outputFile}.")

Precomposed: 19477
Separate accents: 0
Mixed: 0
Accent analysis saved to accentAnalysis.json.


In [9]:
output_path = "gnt_words.txt"  

with open(output_path, "w", encoding="utf-8") as out_file:
    # join with newline characters and end with a final newline
    out_file.write("\n".join(betacodeWords) + "\n")

print(f"Wrote {len(betacodeWords)} words to {output_path}")

Wrote 19446 words to gnt_words.txt


The following script creates a JSON file where the Greek words are the keys and their corresponding Beta Code representations are the values. This dictionairy assists in translating back the results from the Morpheus lookup (which now can be done in multiple other ways like using the newly created TF feature [betacode](https://github.com/tonyjurg/N1904addons/blob/main/docs/features/betacode.md) or on the fly using the [beta_code-py library](https://github.com/perseids-tools/beta-code-py)).

In [4]:
import beta_code
import json

def capitalizeIfAllCaps(word):
    if word.isupper():  # Check if the word is all uppercase
        return word.capitalize()  # Capitalize only the first letter
    return word  # Leave the word unchanged if it's not all uppercase

# Paths to input and output files
inputFile = 'uniqueWords.txt'       # File containing Greek Unicode words
outputFile = 'betaCodeToWord.json'   # File to save the Greek-to-Beta Code mapping

# Read Greek words from the input file
with open(inputFile, 'r', encoding='utf-8') as inFile:
    greekWords = inFile.read().splitlines()

# Create a dictionary with Greek words as keys and Beta Code as values
wordsBetaCodeMap = {beta_code.greek_to_beta_code(capitalizeIfAllCaps(word)): word for word in greekWords}

# Write the dictionary to a JSON file
with open(outputFile, 'w', encoding='utf-8') as outFile:
    json.dump(wordsBetaCodeMap, outFile, ensure_ascii=False, indent=4)

print(f"Created JSON file with {len(wordsBetaCodeMap)} entries: {outputFile}")


Created JSON file with 19477 entries: betaCodeToWord.json


# 6 - Footnotes and attribution<a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

TBD

# 7 - Required libraries<a class="anchor" id="bullet7"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require the following Python libraries to be installed in the environment:

    beta_code 
    json
    os  
    pathlib
    re
    requests
    unicodedata
    xml

You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 8 - Notebook version<a class="anchor" id="bullet8"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>30 April 2025</td>
    </tr>
  </table>
</div>