# Check SP-Morphs in MACULA XML dataset against documentated tags

## Table of content <a class="anchor" id="TOC"></a>(ToC)
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Creating a list of used morph tags in the MACULA N1904 dataset</a>
* <a href="#bullet3">3 - Validate the list of used morph tags against documented tags</a>
   * <a href="#bullet3x1">3.1 - Expand the decoder to report undecodable parts</a>
   * <a href="#bullet3x2">3.2 - Evaluate all tags used in the N1904 MACULA dataset</a>
* <a href="#bullet4">4 - Required libraries</a>
* <a href="#bullet5">5 - Notebook details</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This Jupyter Notebook determines all Morph tags (following the Sandborg-Petersen morphology) used in the MACULA dataset for the Nestle1904 GNT. The results are then compared to [this desciptive document](https://github.com/biblicalhumanities/Nestle1904/blob/master/morph/parsing.txt). The purpose of this notebook is to validate the completeness of the decoder (i.e. that it can succesfully decode all Morph tags found in the N1904 GNT LowFat XML dataset provided by [Clear Bible](https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat)). 

Unicode has two ways of representing a character: decomposed 
and precomposed characters. For instance, the decomposed character ά 
(U+03AC, Greek small letter alpha with tonos) can be rendered by the 
character α (U+03B1) and the acute accent ◌́ (U+0301), or by 
equivalence, the precomposed character ά (U+1F71, Greek small letter 
alpha with oxia). Both of them should be rendered the same way. 
However, Python makes a distinction between those characters. Hence we need to have a routine which ensures all characters are constructed as precomposed unicode characters. This matches our aproach in building the N1904-TF dataset.

In [1]:
import unicodedata

def normalizeToPrecomposed(unicodeString):
    # Make inputstring Unicode string to lowercase and normalize it to its precomposed form (NFC)
    return unicodedata.normalize('NFC', unicodeString.lower())

# Example usage
inputString = "ά"  # Greek small letter alpha and combining acute accent
normalizedString = normalizeToPrecomposed(inputString)

print("Original:", inputString)
print("Normalized:", normalizedString)

Original: ά
Normalized: ά


In [2]:
def displayUnicodeCodePoints(unicodeString):
    # Display each character in a Unicode string with its corresponding code point.
    print("Character\t| Unicode Code Point")
    print("-------------------------------------")
    for char in unicodeString:
        print(f"{char!r:<9}\t| U+{ord(char):04X}")

# Example usage
inputString = "ά-ά"  # Greek small alpha with combining acute accent and precomposed alpha with tonos
displayUnicodeCodePoints(inputString)

Character	| Unicode Code Point
-------------------------------------
'α'      	| U+03B1
'́'      	| U+0301
'-'      	| U+002D
'ά'      	| U+03AC


# 2 - Creating a list of used morph tags in the MACULA N1904 dataset<a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

This script processes a collection of XML files from either a local directory or a GitHub repository to extract all unique morphological tags used in the data, while reporting their frequency of use. The tags are collected into a set to ensure uniqueness and then saved as a JSON file for further analysis.

In [3]:
import requests                     # Used in getRateLimit, getFileList (when fetching files from GitHub), and processFile (when downloading XML content from GitHub)
import xml.etree.ElementTree as ET  # Used in processFile for parsing and extracting data from XML files
import re                           # Used in getFileList to match filenames with a specific pattern.
import json                         # Used in main to save morphological tags and frequencies as JSON files
from pathlib import Path            # Used throughout the code for managing file paths (localInputDir, outputDir, morphTagsFile, morphFrequencyFile).

# There are two options to obtain the source data: from the GitHub repository or from a local directory. 
useLocal = True  # Set to False to fetch files from GitHub

# Details of the source location when using a GitHub repository
owner = "tonyjurg"
repo = "Nestle1904LFT"
branch = "main"
path = "resources/xml/20240210"  # Input XML treebank for the Nestle 1904 Greek New Testament
rawBaseUrl = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}/"  # Base URL for raw file content

# Details of the source location when using local XML files
localInputDir = Path("C:/Users/tonyj/OneDrive/Documents/GitHub/REMA-grammarR-playground/XML-input").resolve()

# Directory to save the output files
outputDir = Path("output")
outputDir.mkdir(parents=True, exist_ok=True)

# Path for the JSON file to store morphological tags
morphTagsFile = outputDir / "morph_tags.json"
morphFrequencyFile = outputDir / "morph_tag_frequencies.json"  # New JSON file for tag frequencies

def getRateLimit():
    """
    Fetch and display the current GitHub API rate limit status.
    """
    rateLimitUrl = "https://api.github.com/rate_limit"
    response = requests.get(rateLimitUrl)
    response.raise_for_status()
    rateLimit = response.json()["rate"]
    print(f"GitHub API Rate Limit: {rateLimit['remaining']} remaining out of {rateLimit['limit']} requests.")

def getFileList():
    """
    Get the list of XML files either from the GitHub repository or from the local directory.
    """
    if useLocal:
        if not localInputDir.exists():
            raise FileNotFoundError(f"Local directory {localInputDir} does not exist.")
        return sorted(
            file.name for file in localInputDir.glob("*.xml") if re.match(r"^\d{2}-", file.name)
        )
    else:
        getRateLimit()
        apiUrl = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
        response = requests.get(apiUrl)
        response.raise_for_status()
        files = response.json()
        return sorted(
            file["name"] for file in files if file["name"].endswith(".xml") and re.match(r"^\d{2}-", file["name"])
        )

def processFile(fileName, morphSet, morphFrequency):
    """
    Parse and process the content of a single XML file to extract morphological tags and their frequencies.
    """
    filePath = localInputDir / fileName if useLocal else f"{rawBaseUrl}{fileName}"
    
    if useLocal:
        with filePath.open("rb") as file:
            xmlContent = file.read()
    else:
        response = requests.get(filePath)
        response.raise_for_status()
        xmlContent = response.content

    try:
        root = ET.fromstring(xmlContent)  # Parse XML content from string
    except Exception as e:
        print(f"Error processing {fileName}: {e}")
        return  # Continue with other files

    for word in root.findall(".//w"):
        morph = word.get("morph")
        if morph:
            morphSet.add(morph)
            morphFrequency[morph] = morphFrequency.get(morph, 0) + 1  # Increment frequency count

def main():
    """
    Main script logic to process XML files and extract morphological tags and their frequencies.
    """
    try:
        fileNames = getFileList()
        print(f"Found {len(fileNames)} XML files to process.")
        
        morphSet = set()  # Store unique morphological tags
        morphFrequency = {}  # Store frequencies of morphological tags
        
        for fileName in fileNames:
            try:
                processFile(fileName, morphSet, morphFrequency)
            except Exception as e:
                print(f"Error processing {fileName}: {e}")
        
        # Save the morphological tags to a JSON file
        with morphTagsFile.open("w", encoding="utf-8") as jsonFile:
            json.dump(sorted(morphSet), jsonFile, ensure_ascii=False, indent=4)
        
        # Save the morphological tag frequencies to another JSON file
        with morphFrequencyFile.open("w", encoding="utf-8") as jsonFile:
            json.dump(morphFrequency, jsonFile, ensure_ascii=False, indent=4)
        
        print(f"Saved morphological tags to {morphTagsFile}")
        print(f"Saved morphological tag frequencies to {morphFrequencyFile}")
        print("Processing complete!")
    except Exception as e:
        print(f"Error fetching file list or processing files: {e}")

if __name__ == "__main__":
    main()


Found 27 XML files to process.
Saved morphological tags to output\morph_tags.json
Saved morphological tag frequencies to output\morph_tag_frequencies.json
Processing complete!


# 3 - Validate the list of used morph tags against documented tags<a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

This part is to perform the validate the completeness of the decoder.

## 3.1 - Expand the decoder to report undecodable parts<a class="anchor" id="bullet3x1"></a>
##### [Back to ToC](#TOC)

The following python code to decode SP Morphs is a modified version of the [SP-Morph-decoder](SP-Morph-decode.py), using the same logic while reporting on any part that could not be decoded.

In [4]:
def decodeTag(tag):
    posMap = {
        "N-": "Noun",
        "A-": "Adjective",
        "T-": "Article",
        "V-": "Verb",
        "P-": "Personal Pronoun",
        "R-": "Relative Pronoun",
        "C-": "Reciprocal Pronoun",
        "D-": "Demonstrative Pronoun",
        "K-": "Correlative Pronoun",
        "I-": "Interrogative Pronoun",
        "X-": "Indefinite Pronoun",
        "Q-": "Correlative/Interrogative Pronoun",
        "F-": "Reflexive Pronoun",
        "S-": "Possessive Pronoun",
        "ADV": "Adverb",
        "CONJ": "Conjunction",
        "COND": "Conditional",
        "PRT": "Particle",
        "PREP": "Preposition",
        "INJ": "Interjection",
        "ARAM": "Aramaic",
        "HEB": "Hebrew",
        "N-PRI": "Proper Noun Indeclinable",
        "A-NUI": "Numeral Indeclinable",
        "N-LI": "Letter Indeclinable",
        "N-OI": "Noun Other Type Indeclinable",
        "PUNCT": "Punctuation"
    }

    caseMap = {
        "N": "Nominative",
        "V": "Vocative",
        "G": "Genitive",
        "D": "Dative",
        "A": "Accusative"
    }

    numberMap = {
        "S": "Singular",
        "P": "Plural"
    }

    genderMap = {
        "M": "Masculine",
        "F": "Feminine",
        "N": "Neuter"
    }

    tenseMap = {
        "P": "Present",
        "I": "Imperfect",
        "F": "Future",
        "2F": "Second Future",
        "A": "Aorist",
        "2A": "Second Aorist",
        "R": "Perfect",
        "2R": "Second Perfect",
        "L": "Pluperfect",
        "2L": "Second Pluperfect",
        "X": "No Tense Stated"
    }

    voiceMap = {
        "A": "Active",
        "M": "Middle",
        "P": "Passive",
        "E": "Middle or Passive",
        "D": "Middle Deponent",
        "O": "Passive Deponent",
        "N": "Middle or Passive Deponent",
        "Q": "Impersonal Active",
        "X": "No Voice"
    }

    moodMap = {
        "I": "Indicative",
        "S": "Subjunctive",
        "O": "Optative",
        "M": "Imperative",
        "N": "Infinitive",
        "P": "Participle",
        "R": "Imperative Participle"
    }

    personMap = {
        "1": "First Person",
        "2": "Second Person",
        "3": "Third Person"
    }

    # Start decoding
    output = {}
    missing = []
    remainingTag = tag.strip()

    # Decode part of speech
    pos = next((key for key in posMap if remainingTag.startswith(key)), None)
    if pos:
        output["partOfSpeech"] = posMap[pos]
        remainingTag = remainingTag[len(pos):]
    else:
        missing.append("partOfSpeech")
        return {"undecodedParts": missing}

    # Further decoding
    if pos in ["N-", "A-", "T-"]:
        if len(remainingTag) > 0:
            output["case"] = caseMap.get(remainingTag[0], "Unknown")
        if len(remainingTag) > 1:
            output["number"] = numberMap.get(remainingTag[1], "Unknown")
        if len(remainingTag) > 2:
            output["gender"] = genderMap.get(remainingTag[2], "Unknown")
    elif pos == "V-":
        if len(remainingTag) > 0:
            output["tense"] = tenseMap.get(remainingTag[0], "Unknown")
        if len(remainingTag) > 1:
            output["voice"] = voiceMap.get(remainingTag[1], "Unknown")
        if len(remainingTag) > 2:
            output["mood"] = moodMap.get(remainingTag[2], "Unknown")
        if len(remainingTag) > 3:
            output["person"] = personMap.get(remainingTag[3], "Unknown")
        if len(remainingTag) > 4:
            output["number"] = numberMap.get(remainingTag[4], "Unknown")

    return output


## 3.2 - Evaluate all tags used in the N1904 MACULA dataset<a class="anchor" id="bullet3x1"></a>
##### [Back to ToC](#TOC)

The following code will read the JSON data and evaluates the MORPH tags against the decodeTag function to check their decodability, and log or output the results, including any undecodable parts.

The execution of the following cell depends on prior creation of `morph_tag_frequencies.json` and the function `decodeTag(tag)`.

In [5]:
from pathlib import Path
import json

def analyzeTagsAgainstDecoder(jsonFilePath, decodeFunction, outputDir, fullOutput=True):
    """
    Analyze tags in a JSON file against the decodeTag function.

    Args:
        jsonFilePath (str): Path to the JSON file containing tags and their frequencies.
        decodeFunction (function): Function to decode the tags.
        outputDir (Path): Directory to save the output files.
        fullOutput (bool): If True, outputs fully decoded tags. If False, outputs a summary of total decoded tags.

    Returns:
        None: Outputs the analysis results to stdout.
    """
    # Load JSON data
    with open(jsonFilePath, "r", encoding="utf-8") as file:
        data = json.load(file)

    undecodedTags = {}
    fullyDecodedTags = {}

    # Analyze each tag
    for tag, frequency in data.items():
        result = decodeFunction(tag)
        if "undecodedParts" in result and result["undecodedParts"]:
            # Store undecoded parts and frequency
            undecodedTags[tag] = {
                "frequency": frequency,
                "undecodedParts": result["undecodedParts"]
            }
        else:
            # Store fully decoded tags
            fullyDecodedTags[tag] = frequency

    # Output analysis
    print("\nUndecoded Tags with Issues:")
    print(json.dumps(undecodedTags, indent=4, ensure_ascii=False))

    if fullOutput:
        print("\nFully Decoded Tags:")
        print(json.dumps(fullyDecodedTags, indent=4, ensure_ascii=False))
    else:
        print("\nTotal Number of Fully Decoded Tags:")
        print(len(fullyDecodedTags))

    # Ensure the output directory exists
    outputDir.mkdir(parents=True, exist_ok=True)

    # Compose the file paths
    fullyDecodedPath = outputDir / "fully_decoded_tags.json"
    undecodedPath = outputDir / "undecoded_tags.json"

    # Save fully decoded tags
    with fullyDecodedPath.open("w", encoding="utf-8") as fullFile:
        json.dump(fullyDecodedTags, fullFile, indent=4, ensure_ascii=False)

    # Save undecoded tags
    with undecodedPath.open("w", encoding="utf-8") as undecodedFile:
        json.dump(undecodedTags, undecodedFile, indent=4, ensure_ascii=False)

    print(f"\nAnalysis complete! Results saved to '{fullyDecodedPath}' and '{undecodedPath}'.")


# Define output directory
outputDir = Path("output")

# Call the function and perform the analysis
analyzeTagsAgainstDecoder("output/morph_tag_frequencies.json", decodeTag, outputDir, fullOutput=False)


Undecoded Tags with Issues:
{}

Total Number of Fully Decoded Tags:
1055

Analysis complete! Results saved to 'output\fully_decoded_tags.json' and 'output\undecoded_tags.json'.


# 4 - Required libraries<a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require the following Python libraries to be installed in the environment:

    json
    pathlib.Path
    re
    requests
    xml.etree
    
You can install any missing library from within Jupyter Notebook using either `pip` or `pip3`.

# 5 - Notebook details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>20 November 2024</td>
    </tr>
  </table>
</div>