# Match N1904-TF wordnodes with Morpheus analysis

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Create list of Greek words in Unicode</a>
* <a href="#bullet3">3 - Analyze Unicode accent storage</a>
* <a href="#bullet4">4 - Convert the word list into betacode</a>
* <a href="#bullet5">5 - Create a JSON dictionairy</a>
* <a href="#bullet6">6 - Atribution and footnotes</a>
* <a href="#bullet7">7 - Required libraries</a>
* <a href="#bullet8">8 - Notebook version</a>


# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook uses feature [betacode]()  to generate a list of all morphemes in the Greek New Testament encoded in BetaCode. This list will be used to match words against the output of the Morpheus morphological tagger. The idea is to create a JSON file with for each wordnode in N1904TF an ordered list of possible morphological interpretations of the surface level form. The one interpretation that matches the closest to the assigned morphology in the N1904-TF dataset will be ranked first. Next the less likely, and so on.

# 2 - Load TF with N1904addons <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

In [1]:
# Load the autoreload extension to automatically reload modules before executing code
%load_ext autoreload
%autoreload 2

In [2]:
# Loading the Text-Fabric code
from tf.fabric import Fabric
from tf.app import use

In [3]:
# Load the N1904-TF app and data with the additional features
A = use ("CenterBLC/N1904", version="1.0.0", mod="tonyjurg/N1904addons/tf/", silence="terse", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/CenterBLC/N1904/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/CenterBLC/N1904/blob/main/docs/viewtypes.md#start) for more information on viewtypes

# 3 - Match all word nodes <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

## 3.1 - Load the parsed morpheus data <a class="anchor" id="bullet3x1"></a>

This section loads the `morpheus_parses.json` file into variable `morph_data`.

In [4]:
import json, os, pathlib
parsed_json_path = pathlib.Path("morpheus_parses.json")

if not parsed_json_path.exists():
    raise FileNotFoundError("parsed_morpheus.json not found – please generate it first.")

with parsed_json_path.open(encoding="utf-8") as f:
    morph_data = json.load(f)

print(f"Morpheus dataset loaded: {len(morph_data):,} word forms")

Morpheus dataset loaded: 19,446 word forms


## 3.2 - Calculate similairity score <a class="anchor" id="bullet3x2"></a>

This section defines `score_parse` that attempts to score a Morpheus parse against TF attributes. Since in our TF dataset we used the full name, while in Morpheus the gramatical properties are abreviated, we nbeed to 

*The weigth factors are still open for experimentation!*

The ranking function assigns points as follows:

| Feature matched | Points |
|-----------------|--------|
| case, number, gender | +2 each |
| tense, mood, voice   | +2 each |
| person               | +1 |
| POS label            | +1 |

In [5]:
# MAP legend: {"TF": "Morpheus"}
CASE_MAP   = {"nominative":"nom","genitive":"gen","dative":"dat",
              "accusative":"acc","vocative":"voc"}
NUMBER_MAP = {"singular":"sg","plural":"pl","dual":"du"}
GENDER_MAP = {"masculine":"masc","feminine":"fem","neuter":"neut"}
TENSE_MAP  = {"present":"pres","imperfect":"imperf","future":"fut",
              "aorist":"aor","perfect":"perf","pluperfect":"plup"}
MOOD_MAP   = {"indicative":"ind","subjunctive":"subj","optative":"opt",
              "imperative":"imperat","infinitive":"inf","participle":"part"}
VOICE_MAP  = {"active":"act","middle":"mid","passive":"pass",
              "middlepassive":"mid/pass","middle/passive":"mid/pass",
              "mid/pass":"mid/pass","act/pass":"act/pass"}
PERSON_MAP = { "p1": "1st", "p2":"2nd", "p3":"3rd"}
SP_MAP     = {"subs":"noun","noun":"noun","adj":"adjective","adv":"adverb",
              "prep":"prep","conj":"conj","prn":"pron",
              "art":"art","part":"particle","verb":"verb"}

def _abbr(val, mapping):
    # Return Morpheus abbreviation for a TF value (or the value itself)
    if val is None:
        return None
    return mapping.get(val.lower(), val.lower())

def normalise_tf_attrs(tf_attrs):
    #Convert a TF-attribute dict to Morpheus-style abbreviations.
    return {
        "case"  : _abbr(tf_attrs.get("case"),   CASE_MAP),
        "number": _abbr(tf_attrs.get("number"), NUMBER_MAP),
        "gender": _abbr(tf_attrs.get("gender"), GENDER_MAP),
        "tense" : _abbr(tf_attrs.get("tense"),  TENSE_MAP),
        "mood"  : _abbr(tf_attrs.get("mood"),   MOOD_MAP),
        "voice" : _abbr(tf_attrs.get("voice"),  VOICE_MAP),
        "person": _abbr(tf_attrs.get("person"), PERSON_MAP),
        "sp"    : _abbr(tf_attrs.get("sp"),     SP_MAP),
    }

def score_parse(parse_dict, tf_attrs, debug=False):
    """
    Return an integer score for how well `parse_dict` matches TF attributes.
    +2 for each exact match of core features; +1 for person and POS.
    This weighting might be reviewed?
    """
    tf = normalise_tf_attrs(tf_attrs)
    score = 0

    for key in ("case", "number", "gender", "tense", "mood", "voice"):
        if tf.get(key) and parse_dict.get(key) == tf[key]:
            score += 2
    if tf.get("person") and parse_dict.get("person") == tf["person"]:
        score += 1
    if tf.get("sp") and parse_dict.get("POS") == tf["sp"]:
        score += 1

    if debug and score == 0:
        print("No match found for")
        print("  Parse:", parse_dict)
        print("  TF   :", tf_attrs)
        print("-" * 40)

    return score

## 3.3 - Sort result based on score <a class="anchor" id="bullet3x3"></a>

This part simply ranks all parses for every word node and store them (with scores) in a dictionary.

In [6]:
from collections import defaultdict
from tqdm.auto import tqdm

ranked_parses = {}          # node → list[dict_with_score]
missing_beta = []           # word forms not in morph_data

word_nodes = list(F.otype.s("word"))
for w in tqdm(word_nodes, desc="ranking parses"):
    beta = F.betacode.v(w)
    parses = morph_data.get(beta)
    if not parses:
        missing_beta.append(beta)
        continue

    tf_attrs = dict(
        case   = F.case.v(w),
        number = F.number.v(w),
        gender = F.gender.v(w),
        tense  = F.tense.v(w),
        mood   = F.mood.v(w),
        voice  = F.voice.v(w),
        person = F.person.v(w),
        sp     = F.sp.v(w),
    )

    # Tony temp test!
    #print (normalise_tf_attrs(tf_attrs))

    # Attach a score to every parse copy
    ranked = []
    for p in parses:
        p_copy = p.copy()
        p_copy["_score"] = score_parse(p, tf_attrs,debug=False)
        ranked.append(p_copy)

    # Sort from best to worst (highest score first, tie‑breaker: original order)
    ranked.sort(key=lambda d: d["_score"], reverse=True)
    ranked_parses[w] = ranked


print(f"Ranked parses ready for {len(ranked_parses):,} / {len(word_nodes):,} word nodes.")
print(f"{len(missing_beta):,} word forms had no Morpheus entry.")

ranking parses:   0%|          | 0/137779 [00:00<?, ?it/s]

Ranked parses ready for 135,358 / 137,779 word nodes.
2,421 word forms had no Morpheus entry.


## 3.4 - Export to JSON <a class="anchor" id="bullet3x4"></a>

Now export the build dictionary to `word2parse_ranked.json`.

In [7]:
out_json = "word2parse_ranked.json"
with open(out_json, "w", encoding="utf-8") as f:
    json.dump(ranked_parses, f, ensure_ascii=False, indent=2)

print(f"Saved ranked parse dictionary to {out_json}")

Saved ranked parse dictionary to word2parse_ranked.json


## 3.5 - Dump random sample <a class="anchor" id="bullet3x5"></a>

Rerun cell below to get other samples.

In [12]:
import random, pprint, json, textwrap
sample_nodes = random.sample(list(ranked_parses.keys()), 5)
for n in sample_nodes:
    A.webLink(n)
    print(f"Node {n} | {F.trans.v(n)}  |  {F.text.v(n)} |  {F.betacode.v(n)}")
    pprint.pp(ranked_parses[n][:5])   # show the top 4 parses
    print("-" * 40)

Node 29796 | shall he drink  |  πίῃ |  pi/h|
[{'raw_beta': 'pi/h|',
  'work_beta': 'pi/h|',
  'work_unicode': 'πίῃ',
  'lemma_beta': 'pi/nw',
  'lemma_base': 'pi/nw',
  'lemma_unicode': 'πίνω',
  'lemma_base_unicode': 'πίνω',
  'stem_beta': 'pi',
  'stem_unicode': 'πι',
  'morph_codes': ['aor2', 'aor2'],
  'ending_beta': 'h|',
  'ending_unicode': 'ῃ',
  'tense': 'aorist',
  'mood': 'subjunctive',
  'voice': 'active',
  'other_end_tokens': ['3rd'],
  'number': 'sg',
  'raw_unicode': 'πίῃ',
  'POS': 'verb',
  'score': 7},
 {'raw_beta': 'pi/h|',
  'work_beta': 'pi/h|',
  'work_unicode': 'πίῃ',
  'lemma_beta': 'pi/nw',
  'lemma_base': 'pi/nw',
  'lemma_unicode': 'πίνω',
  'lemma_base_unicode': 'πίνω',
  'stem_beta': 'pi',
  'stem_unicode': 'πι',
  'morph_codes': ['aor2', 'aor2'],
  'ending_beta': 'h|',
  'ending_unicode': 'ῃ',
  'tense': 'aorist',
  'mood': 'subjunctive',
  'voice': 'middle/passive',
  'other_end_tokens': ['2nd'],
  'number': 'sg',
  'raw_unicode': 'πίῃ',
  'POS': 'verb',


Node 50982 | Him  |  αὐτὸν |  au)to\n
[{'raw_beta': 'au)to\\n',
  'work_beta': 'au)to/n',
  'work_unicode': 'αὐτόν',
  'lemma_beta': 'au)to/s',
  'lemma_base': 'au)to/s',
  'lemma_unicode': 'αὐτός',
  'lemma_base_unicode': 'αὐτός',
  'stem_beta': 'au)t',
  'stem_unicode': 'αὐτ',
  'morph_codes': ['art_adj', 'art_adj'],
  'ending_beta': 'on',
  'ending_unicode': 'ον',
  'gender': 'masc',
  'case': 'acc',
  'number': 'sg',
  'raw_unicode': 'αὐτὸν',
  'POS': 'noun',
  'score': 0}]
----------------------------------------


Node 115236 | useless  |  ἄχρηστον |  a)/xrhston
[{'raw_beta': 'a)/xrhston',
  'work_beta': 'a)/xrhston',
  'work_unicode': 'ἄχρηστον',
  'lemma_beta': 'a)/xrhstos',
  'lemma_base': 'a)/xrhstos',
  'lemma_unicode': 'ἄχρηστος',
  'lemma_base_unicode': 'ἄχρηστος',
  'stem_beta': 'a)xrhst',
  'stem_unicode': 'ἀχρηστ',
  'morph_codes': ['os_on', 'os_on'],
  'ending_beta': 'on',
  'ending_unicode': 'ον',
  'other_end_tokens': ['masc/fem'],
  'case': 'acc',
  'number': 'sg',
  'raw_unicode': 'ἄχρηστον',
  'POS': 'noun',
  'score': 0},
 {'raw_beta': 'a)/xrhston',
  'work_beta': 'a)/xrhston',
  'work_unicode': 'ἄχρηστον',
  'lemma_beta': 'a)/xrhstos',
  'lemma_base': 'a)/xrhstos',
  'lemma_unicode': 'ἄχρηστος',
  'lemma_base_unicode': 'ἄχρηστος',
  'stem_beta': 'a)xrhst',
  'stem_unicode': 'ἀχρηστ',
  'morph_codes': ['os_on', 'os_on'],
  'ending_beta': 'on',
  'ending_unicode': 'ον',
  'gender': 'neut',
  'other_end_tokens': ['nom/voc/acc'],
  'number': 'sg',
  'raw_unicode': 'ἄχρηστον',
  'PO

Node 18815 | city  |  πόλις |  po/lis
[{'raw_beta': 'po/lis',
  'work_beta': 'po/li_s',
  'work_unicode': 'πόλι—ς',
  'lemma_beta': 'po/lis',
  'lemma_base': 'po/lis',
  'lemma_unicode': 'πόλις',
  'lemma_base_unicode': 'πόλις',
  'stem_beta': 'pol',
  'stem_unicode': 'πολ',
  'gender': 'fem',
  'morph_codes': ['is_ews', 'is_ews'],
  'ending_beta': 'i_s',
  'ending_unicode': 'ι—ς',
  'case': 'acc',
  'number': 'pl',
  'dialects': ['epic', 'doric', 'ionic', 'aeolic'],
  'raw_unicode': 'πόλις',
  'POS': 'noun',
  'score': 0},
 {'raw_beta': 'po/lis',
  'work_beta': 'po/lis',
  'work_unicode': 'πόλις',
  'lemma_beta': 'po/lis',
  'lemma_base': 'po/lis',
  'lemma_unicode': 'πόλις',
  'lemma_base_unicode': 'πόλις',
  'stem_beta': 'pol',
  'stem_unicode': 'πολ',
  'gender': 'fem',
  'morph_codes': ['is_ews', 'is_ews'],
  'ending_beta': 'is',
  'ending_unicode': 'ις',
  'case': 'nom',
  'number': 'sg',
  'raw_unicode': 'πόλις',
  'POS': 'noun',
  'score': 0}]
----------------------------------

Node 48336 | the  |  τοῖς |  toi=s
[{'raw_beta': 'toi=s',
  'work_beta': 'toi=s',
  'work_unicode': 'τοῖς',
  'lemma_beta': 'o(',
  'lemma_base': 'o(',
  'lemma_unicode': 'ὁ',
  'lemma_base_unicode': 'ὁ',
  'stem_beta': 'toi=s',
  'stem_unicode': 'τοῖς',
  'morph_flags': ['indeclform', 'indeclform'],
  'other_end_tokens': ['masc/neut'],
  'case': 'dat',
  'number': 'pl',
  'morph_codes': ['article'],
  'raw_unicode': 'τοῖς',
  'POS': 'art',
  'score': 1}]
----------------------------------------


In [9]:
from collections import defaultdict
ranked_parses = {}

for w in F.otype.s('word'):
    beta = F.betacode.v(w)
    parses = morph_data.get(beta)
    if not parses:
        continue

    tf_attrs = dict(
        case   = F.case.v(w),
        number = F.number.v(w),
        gender = F.gender.v(w),
        tense  = F.tense.v(w),
        mood   = F.mood.v(w),
        voice  = F.voice.v(w),
        person = F.person.v(w),
        sp     = F.sp.v(w),
    )

    # Compute score for *every* parse and keep it in a tuple (score, parse_dict)
    scored = []
    for p in parses:
        score = 0
        for key in ("case", "number", "gender"):
            if tf_attrs.get(key) and p.get(key) == tf_attrs[key]:
                score += 2
        for key in ("tense", "mood", "voice"):
            if tf_attrs.get(key) and p.get(key) == tf_attrs[key]:
                score += 2
        if tf_attrs.get("person") and p.get("person") == tf_attrs["person"]:
            score += 1
        if tf_attrs.get("sp") and p.get("POS") == tf_attrs["sp"]:
            score += 1
        scored.append((score, p))

    # Sort descending by score (highest first); keep the score inside each dict
    scored.sort(key=lambda t: t[0], reverse=True)
    ranked = []
    for sc, pd in scored:
        pd_with_score = dict(pd)  # shallow copy
        pd_with_score["score"] = sc
        ranked.append(pd_with_score)
    ranked_parses[w] = ranked

print(f"Ranked parses prepared for {len(ranked_parses):,} word nodes.")


Ranked parses prepared for 135,358 word nodes.


In [10]:
rank_json_path = "word2parse_ranked.json"
with open(rank_json_path, "w", encoding="utf-8") as f:
    json.dump(ranked_parses, f, ensure_ascii=False, indent=2)

print(f"JSON file saved to {rank_json_path}  "
      f"({sum(len(v) for v in ranked_parses.values()):,} total parse rows)")


JSON file saved to word2parse_ranked.json  (243,302 total parse rows)


In [11]:
import random, pprint

print("Random sample of ranked parses:")
for n in random.sample(list(ranked_parses), 5):
    print("="*70)
    print(f"Node {n}:  {F.text.v(n)}  {F.trans.v(n)}  (betacode: {F.betacode.v(n)})")
    for rank, pd in enumerate(ranked_parses[n][:2], 1):
        print(f"  #{rank}  score={pd['score']}")
        pprint.pp({k:v for k,v in pd.items() if k!='score'})


Random sample of ranked parses:
Node 51262:  ὅτι  that  (betacode: o(/ti)
  #1  score=1
{'raw_beta': 'o(/ti',
 'work_beta': 'o(/ti',
 'work_unicode': 'ὅτι',
 'lemma_beta': 'o(/ti2',
 'lemma_base': 'o(/ti',
 'homonym': 2,
 'lemma_unicode': 'ὅτι2',
 'lemma_base_unicode': 'ὅτι',
 'stem_beta': 'o(/ti',
 'stem_unicode': 'ὅτι',
 'morph_flags': ['indeclform', 'indeclform'],
 'morph_codes': ['conj'],
 'raw_unicode': 'ὅτι',
 'POS': 'conj'}
  #2  score=0
{'raw_beta': 'o(/ti',
 'work_beta': 'o(/ti',
 'work_unicode': 'ὅτι',
 'lemma_beta': 'o(/stis',
 'lemma_base': 'o(/stis',
 'lemma_unicode': 'ὅστις',
 'lemma_base_unicode': 'ὅστις',
 'stem_beta': 'o(/ti',
 'stem_unicode': 'ὅτι',
 'morph_flags': ['indeclform', 'indeclform'],
 'gender': 'neut',
 'other_end_tokens': ['nom/acc'],
 'number': 'sg',
 'morph_codes': ['relative'],
 'raw_unicode': 'ὅτι',
 'POS': 'noun'}
Node 66409:  Θεὸς  God  (betacode: *qeo\s)
  #1  score=0
{'raw_beta': '*qeo\\s',
 'work_beta': 'qeo/s',
 'work_unicode': 'θεός',
 'lemma_be

# 3 - Acknowledgements <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook used the following sources for the analysis and implementation:

- [Morpheus Morphological Analyzer (Perseus Project)](https://github.com/perseids-tools/morpheus/)
- [Greek Beta Code standard](https://stephanus.tlg.uci.edu/encoding/BCM.pdf)
- [beta-code-py](https://github.com/perseids-tools/beta-code-py)

The [Anaconda Asisstant](https://www.anaconda.com/capability/anaconda-assistant) (using [OpenAI](https://openai.com/) as backend) was used to debug and/or optimze the code in this Jupyter Notebook.

# 4 - Required libraries <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require Python 3.8+ and the following libraries to be installed in the environment:

``` python
    beta_code
    json
    pathlib
```

You can install any missing library from within Jupyter Notebook using either `pip` or `pip3`.

# 5 - Notebook version details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.2</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>30 April 2025</td>
    </tr>
  </table>
</div>