# Crosscheck N1904-TF and Morpheus based SP tags

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Mathematical concepts</a>
* <a href="#bullet3">3 - Create set $S$ (SP tags in N1904-TF)</a>
* <a href="#bullet4">4 - Create set $M$ (Morpheus derived SP tags)</a>
* <a href="#bullet5">5 - Validate condition one: $S \subseteq M$</a>
* <a href="#bullet6">6 - Validate condition two: $M \subseteq S$</a>
* <a href="#bullet7">7 - Some specific testcases</a>
* <a href="#bullet9">9 - Acknowledgements</a>
* <a href="#bullet10">10 - Required libraries</a>
* <a href="#bullet11">11 - Notebook version details</a>

#  1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This notebook performs a crosscheck between the SP-tags generated by the [Morpkit package](https://tonyjurg.github.io/morphkit/) and the reference mappings from the N1904-TF dataset.

The goal is to identify any potential missing or incorrect mappings that may occur during the process of determining SP-tags from Morpheus output. In this comparison, the N1904-TF dataset is treated as a large and authoritative set of valid SP-tag assignments. Obviously it is perfectly legal that Morpheus 'comes up' with additional tags not found in the N1904-TF dataset. But any of those addtional tags are to be looked at.

# 2 - Mathematical concepts <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

The test to be performed can be described mathematically by examining the relationship between two distinct sets which we will create in this Jupyter Notebook:

Set | Description
---|---
$M$ | contains all possible `SP-tags` constructed based upon oputput from Morpheus.
$S$ | contains all possible `SP-tags` found in the source XML (N1904-TF / MACULA data).

Using these two distinct sets we can check (validate) the following two mappings:

**N1904-TF → Morpheus**  

This test essentially verifies whether all tag forms found in the Greek New Testament (set $S$) are also present in the tag list derived from the Morpheus output (set $M$). To formalize this, we define a mapping $ f: S \to M $, where each SP-tag from the GNT source XML (set $S$) is mapped to a corresponding Morpheus-derived tag (set $M$). In this case, we expect that $ S \subseteq M $; that is, every SP-tag in $S$ should also be found in $M$. If this condition is not met, further investigation is needed to understand and explain why certain mappings are missing.

**Morpheus → N1904-TF**  

This part checks which tags constructed from the Morpheus data are actually NOT found in the Greek New Testament (GNT). In principle, it is quite possible that valid morphological tags could be derived from Morpheus that are not attested in the GNT.

Formally, we could define a restriction of $𝑀$ to the forms relevant for the GNT, expressed like: 
$$M' = \{ m \in M \mid \text{form}(m) \in \text{FormsInNT} \}$$
However, introducing this restriction would increase complexity (like difining $text{FormsInNT}$) without significantly improving the validation process, since the resulting set is expected to be small and the aditional condition risks overlooking relevant cases.

Therefore, a simpler approach is to directly check whether $ M \subseteq S $. In fact, we expect that $M \not\subseteq S$.

This difference set $$M \setminus S$$  identifies Morpheus parses for forms that exist in $M$, but with SP-tags not attested in the GNT, set $S$. The resulting set can than be further analysed.

## 2.1 - Load N1904-TF with N1904Addons<a class="anchor" id="bullet2x1"></a>

We need to load the N1904-TF dataset together with the additional dataset [N1904Addons](https://tonyjurg.github.io/N1904addons) since we would like to test the morph tags derived from Morpheus.

In [2]:
# Load the autoreload extension to automatically reload modules before executing code
%load_ext autoreload
%autoreload 2

In [3]:
# Import required modules, including Text-Fabric (tf)
from tf.app import use
import os
import sys

In [4]:
# Load the N1904-TF app and data 
A = use ("CenterBLC/N1904", version="1.0.0", mod="tonyjurg/N1904addons/tf/", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

# 3 - Create set $S$ (SP tags in N1904-TF) <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

The first step is to create a list of unique SP morphological tags in the New Testament. This will be otbained from the MACULA XML source data.

In [26]:
from collections import Counter

morphFrequency = Counter()
for wordNode in F.otype.s("word"):
    morph = F.morph.v(wordNode)
    if morph:
        morphFrequency[morph] += 1
        
N1904_morphs = list(morphFrequency)
print("First 10 unique morphs:", N1904_morphs[:10])

First 10 unique morphs: ['N-NSF', 'N-GSF', 'N-GSM', 'N-PRI', 'V-AAI-3S', 'T-ASM', 'CONJ', 'N-ASM', 'T-APM', 'N-APM']


#  4 - Create set $M$ (Morpheus derived SP tags) <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

The following  Python script loads the parsed JSON output that was created earlier and walks through all the entries and inflections. It then identifies unique Sandborg-Petersen tags and stores them into a textfile.

In [27]:
from collections import Counter
import pandas as pd

# Accumulate into a single Counter
morphFrequency = Counter()

# itterate over all wordnodes
for wordNode in F.otype.s("word"):
    for blockNumber in range(1, 9):
        lemma = Fs(f"ms{blockNumber}_lemma").v(wordNode)
        if not lemma:
            continue
        morph_string = Fs(f"ms{blockNumber}_morph").v(wordNode)
        if not morph_string:
            continue

        for tag in morph_string.split("/"):
            tag = tag.strip()
            if tag:
                morphFrequency[tag] += 1

Morpheus_morphs = list(morphFrequency)
print("First 10 unique morphs:", Morpheus_morphs[:10])

First 10 unique morphs: ['N-NSM', 'N-NSF', 'N-GSF', 'N-GSM', 'N-VSM', 'N-PRI', 'N-GSN', 'V-IEI-2S', 'V-IAI-3S', 'V-PEM-2S']


# 7 - Validating $M \subseteq S$  and $S \subseteq M$ with a set explorer

The following code makes it possible to compare the sets $M$ and $S$ interactively. 

In [28]:
import json
from IPython.display import display, HTML

def loadTagsFromFile(filePath):
    with open(filePath, "r", encoding="utf-8") as f:
        tags = {line.strip() for line in f if line.strip()}
    return tags

def groupTags(tags):
    grouped = {}
    for tag in tags:
        parts = tag.split("-")
        if len(parts) < 2:
            primary = parts[0]
            secondary = ""
        else:
            primary, secondary = parts[0], parts[1]

        if primary not in grouped:
            grouped[primary] = {}

        if secondary not in grouped[primary]:
            grouped[primary][secondary] = []

        grouped[primary][secondary].append(tag)

    return grouped

def displayGroupedTags(groupedTags, title="Grouped Tags"):
    html = f"<h3>{title}</h3>"

    for primary in sorted(groupedTags.keys()):
        secondaryGroups = groupedTags[primary]
        
        totalPrimary = sum(len(tags) for tags in secondaryGroups.values())

        html += f"<details><summary><b>Primary: {primary}</b> ({totalPrimary} tags)</summary>"

        for secondary in sorted(secondaryGroups.keys()):
            tagsInSecondary = secondaryGroups[secondary]
            html += f"<details style='margin-left:20px'><summary>Secondary: {secondary} ({len(tagsInSecondary)} tags)</summary>"
            html += "<ul>"
            for tag in sorted(tagsInSecondary):
                html += f"<li>{tag}</li>"
            html += "</ul></details>"

        html += "</details>"

    display(HTML(html))

def exploreSets(setA, setB, labelA="Set A", labelB="Set B"):
    onlyInA = setA - setB
    onlyInB = setB - setA
    inBoth = setA & setB

    print(f"{labelA}: {len(setA)} tags")
    print(f"{labelB}: {len(setB)} tags")
    print(f"Shared: {len(inBoth)} tags")
    print(f"Only in {labelA}: {len(onlyInA)} tags")
    print(f"Only in {labelB}: {len(onlyInB)} tags")

    if onlyInA:
        groupedOnlyA = groupTags(onlyInA)
        displayGroupedTags(groupedOnlyA, title=f"Tags only in {labelA}")

    if onlyInB:
        groupedOnlyB = groupTags(onlyInB)
        displayGroupedTags(groupedOnlyB, title=f"Tags only in {labelB}")

    if inBoth:
        groupedInBoth = groupTags(inBoth)
        displayGroupedTags(groupedInBoth, title=f"Tags in both {labelA} and {labelB}")

if __name__ == "__main__":
   # setSFile = "validation/setSTags.txt"  # SP tags from MACULA XML   
   #setMFile = "validation/setMTags.txt"  # SP tags derived from Morpheus

    #S = loadTagsFromFile(setSFile)
    #M = loadTagsFromFile(setMFile)

    # Now call exploreSets
    exploreSets(
        set(N1904_morphs), 
        set(Morpheus_morphs), 
        labelA="Source N1904-TF (S)", 
        labelB="Morpheus derived (M)"
    )


Source N1904-TF (S): 1055 tags
Morpheus derived (M): 1010 tags
Shared: 500 tags
Only in Source N1904-TF (S): 555 tags
Only in Morpheus derived (M): 510 tags


# 9 - Required libraries <a class="anchor" id="bullet9"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require Python 3.8+ and the following libraries to be installed in the environment:

``` python
    IPython
    json
    os
```

You can install any missing library from within Jupyter Notebook using either `pip` or `pip3`.

# 10 - Notebook version details<a class="anchor" id="bullet10"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.2</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>29 April 2025</td>
    </tr>
  </table>
</div>