# Creation Morpheus features for N1904-TF dataset (update 22 june)

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load app and data</a>
* <a href="#bullet3">3 - Adding features to currently loaded TF dataset</a>
    * <a href="#bullet3x1">3.1 - Prepare metadata</a>
    * <a href="#bullet3x2">3.2 - Prepare data for feature betacode</a>
    * <a href="#bullet3x3">3.3 - Link metadata to the featuredata</a>
    * <a href="#bullet3x4">3.4 - Save the feature to files</a>
    * <a href="#bullet3x5">3.5 - Reload Text-Fabric with the new feature</a>
    * <a href="#bullet3x6">3.6 - Check if the new feature is loaded</a>    
    * <a href="#bullet3x7">3.7 - Move the newly created feature to final location</a>
* <a href="#bullet4">4 - Attribution and footnotes</a>
* <a href="#bullet5">5 - Required libraries</a>
* <a href="#bullet6">6 - Notebook version</a>


#  1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This Jupyter Notebook documents and demonstrates the complete workflow (data loading, transformation, and export) used to enrich the [N1904 TF dataset](https://github.com/CenterBLC/N1904) (version 1.0.0) with a set of Morpheus related features.

In very high level terms the following actions are performed in this notebook:

   - Load the N1904 corpus with the additional features, including `betacode`.
   - Import a dedicated Python package that lets the notebook query the Morpheus service in real time (REST API client)
   - A template function is defined to ensure that the metadata for any TF feature is consistent.
   - The metadata and dictionaries for all TF features are defined using scripting techniques.
   - The main loop to generate the TF feature data loops over all word nodes in the N1904 corpus and:
       - Look up the word in Morpheus (*real-time API call*)
       - Update the per-word metadata features
       - Store a set of summary feature for quick overview
       - for each word a set of summary features is populated
   - After all word nodes are processed, the data dictionaries are linked to the their meta data
   - The whole set is written to disk in one atomic action (*TF.save*).
   - The new features are then loaded and verified.
   - The last step is to copy the new feature files to a permanent GitHub location.

# 2 -  Load TF with N1904addons <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

Since the new feature should act as an extention to the N1904-TF dataset, we first need to load this dataset, together with the Text-Fabric Python code.

In [1]:
# Load the autoreload extension to automatically reload modules before executing code
%load_ext autoreload
%autoreload 2

In [2]:
# Loading the Text-Fabric code
from tf.fabric import Fabric
from tf.app import use

In [3]:
# Load the N1904-TF app and data with the additional features
A = use ("CenterBLC/N1904", mod="tonyjurg/N1904addons/tf", silence="terse", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [4]:
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
A.dh(A.getCss())

# 3 - Load the morphkit library <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

In order to streamline communication with the Morpheus service running in my Docker container, I have eveloping a Python API called morphkit which I have hosted on GitHub (https://github.com/tonyjurg/morphkit).

In [5]:
import sys
sys.path.insert(0, "../../morphkit")    # relative to notebook dir
import morphkit

morphkit loaded


# 3 - Adding features to currently loaded TF dataset<a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

In order to add a new set of features to a Text-Fabric dataset, we first prepare the metadata. Next, we will generate the feature data itself, and finally link both together and to the existing dataset.

## 3.1 - Prepare metadata<a class="anchor" id="bullet3x1"></a>

The following code block defines a function that generates metadata to be included at the top of every new Text-Fabric feature file (.tf).

In [6]:
# Common Text-Fabric metadata template function
def createMetadata(description,type):
    return {
        'author': 'Morpheus (perseids-tools)',
        'convertedBy': 'Tony Jurg',
        'website': 'https://github.com/tonyjurg/N1904addon', 
        'description': description,
        'coreData': 'Nestle 1904 Text-Fabric (centerBLC)',
        'coreDataUrl': 'https://github.com/CenterBLC/N1904',
        'provenance': 'jupyter Notebook (https://github.com/tonyjurg/create_TF_feature_betacode)',
        'version': '1.0.0',   # This is the version of the N1904-TF dataset against which this feature is build!
        'license': 'Creative Commons Attribution 4.0 International (CC BY 4.0)',
        'licenseUrl': 'https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md',
        'valueType': type
    }

In [7]:
# Create metadata for Morpheus Metadata (mm_xxx) features using createMetadata function
mm_raw_bc_Metadata      = createMetadata('Morpheus :raw in betacode',           'str')
mm_raw_uc_Metadata      = createMetadata('Morpheus :raw in unicode',            'str')
mm_num_blocks_Metadata  = createMetadata('Morpheus total number of returned analytic blocs',   'int')
mm_num_lemmas_Metadata  = createMetadata('Morpheus total number of returned different lemmas','int')
mm_num_morphs_Metadata  = createMetadata('Morpheus total number of returned different morph tags','int')
mm_max_sim_Metadata     = createMetadata('Morpheus maximum similarity with N1904-TF morph tag for any lemma','int')
mm_gram_dif_Metadata    = createMetadata('grammatical difference against N1904-TF (field)','str')

In [8]:
# Create metadata for Morpheus Summary (ms_xxxx) features using createMetadata function 
# The result will be pushed into dictionary ms_metadata

number_of_ms_sets=11  # the actual number of feature sets is one lower than this figure!

ms_features = [
    # field_name      description                                             datatype
    
    # Lemma related:
    ('lem_full_bc',  'full lemma (incl homonym and pl suffixes) in betacode', 'str'),
    ('lem_full_uc',  'full lemma (incl homonym and pl suffixes) in unicode',  'str'),
    ('lem_base_bc',  'full lemma in betacode',                                'str'),
    ('lem_base_uc',  'base lemma in unicode',                                 'str'),
    ('lem_homonym',  'lemma homonym indicator',                               'str'),
    ('lem_pl_suff',  'lemma pl suffix indicator',                             'str'),

    # prvb related:
    ('prvb_bc',      'prvb in betacode',                                      'str'),
    ('prvb_uc',      'prvb in unicode',                                       'str'),
    
    # Morph related:
    ('morph',        'list of morph tags',                                    'str'),
    ('morph_sim',    'list of morph similairities to N1904-TF morph',         'str'),
    ('num_morphs',   'Number of morph tags in this summary',                  'int'),
    ('gram_dif',     'grammatical difference against N1904-TF (field)',       'str'),
    
    # Block accounting related:
    ('num_blocks',   'number of analyses blocks for this lemma',              'int'),
    ('block_nums',   'list of function reference numbers for this lemma',     'str'), 
]

# Build *all* Morpheus Summary feature metadata
ms_metadata = {
    f"ms{blk}_{field_name}": 
        createMetadata(f"Morpheus summary group {blk}: {desc}", datatype)
    for blk in range(1, number_of_ms_sets)
    for field_name, desc, datatype in ms_features
}

In [9]:
# inspect what we have now...
import pprint
pprint.pprint(list(ms_metadata.keys())[:15])

['ms1_lem_full_bc',
 'ms1_lem_full_uc',
 'ms1_lem_base_bc',
 'ms1_lem_base_uc',
 'ms1_lem_homonym',
 'ms1_lem_pl_suff',
 'ms1_prvb_bc',
 'ms1_prvb_uc',
 'ms1_morph',
 'ms1_morph_sim',
 'ms1_num_morphs',
 'ms1_gram_dif',
 'ms1_num_blocks',
 'ms1_block_nums',
 'ms2_lem_full_bc']


In [10]:
print(ms_metadata.get('ms1_num_blocks').get('valueType'))

int


In [11]:
# Create metadata for Morpheus Details (md_xxxx) features using createMetadata function
number_of_md_sets=25  # how many sets of Morpheus Details features (the actual is one lower!)

# First, define the features for every Morpheus analytic block:
md_features = [
    # row ':workw'
    ('workw_bc',      ':workw line - workw in betacode'),
    ('workw_uc',      ':workw line - workw in unicode'),

    # row ':lem'
    ('lem_full_bc',   ':lem line - raw lemma (incl. homonym or pl-suffix) in betacode'),
    ('lem_full_uc',   ':lem line - raw lemma (incl. homonym or pl-suffix) in unicode'),
    ('lem_base_bc',   ':lem line - base lemma in betacode'),
    ('lem_base_uc',   ':lem line - base lemma in unicode'), 
    ('lem_homonym',   ':lem line - lemma homonym'),
    ('lem_pl_suff',   ':lem line - lemma pl suffix'),

    # row ':prvb'
    ('prvb_bc',       ':prvb line - prepostions space separated list in betacode'),
    ('prvb_uc',       ':prvb line - prepostions slash separated list in unicode'),
    
    # row ':aug1'
    ('aug1_bc',       ':aug1 line - augment in betacode'),
    ('aug1_uc',       ':aug1 line - augment in unicode'),

    # row ':stem'
    ('stem_bc',       ':stem line - stem in betacode'),
    ('stem_uc',       ':stem line - stem in unicode'),
    ('stem_codes',    ':stem line - listed morph codes'),
    ('stem_flags',    ':stem line - listed morph flags'),

    # row ':suff' : EMPTY for GNT

    # row ':end'
    ('end_bc',        ':end line - ending in betacode'),
    ('end_uc',        ':end line - ending in unicode'),
    ('end_codes',     ':end line - listed morph codes'),
    ('end_flags',     ':end line - listed morph flags'),
    
    # detailed base features (partly deducted)
    ('gender',        'grammatical gender'),
    ('number',        'grammatical number'),
    ('case' ,         'grammatical case' ),
    ('voice',         'verb voice'),
    ('mood',          'verb mood'),
    ('person',        'grammatical person'),
    ('tense',         'verb tense'),
    ('sec_tense',     'second form of the tense'),    
    ('degree',        'degree (comparative/superlative)'),
    ('typems',        'type of pronoun'),

    # aggregated from various rows
    ('dialects',      'listed dialects'),

    # Deducted (summary for single block)
    ('sp',            'part of speech'),
    ('morph',         'morphological tag following SP'),
    ('gram_dif',      'grammatical difference against N1904-TF (field)')
]

# Build *all* Morpheus Details feature metadata 
md_metadata = {
    f"md{blk}_{field_name}": 
        createMetadata(f"Morpheus analytic block {blk} {desc}", 'str')
    for blk in range(1, number_of_md_sets)
    for field_name, desc in md_features
}

In [12]:
# inspect what we have now...
import pprint
pprint.pprint(list(md_metadata.keys())[:35])

['md1_workw_bc',
 'md1_workw_uc',
 'md1_lem_full_bc',
 'md1_lem_full_uc',
 'md1_lem_base_bc',
 'md1_lem_base_uc',
 'md1_lem_homonym',
 'md1_lem_pl_suff',
 'md1_prvb_bc',
 'md1_prvb_uc',
 'md1_aug1_bc',
 'md1_aug1_uc',
 'md1_stem_bc',
 'md1_stem_uc',
 'md1_stem_codes',
 'md1_stem_flags',
 'md1_end_bc',
 'md1_end_uc',
 'md1_end_codes',
 'md1_end_flags',
 'md1_gender',
 'md1_number',
 'md1_case',
 'md1_voice',
 'md1_mood',
 'md1_person',
 'md1_tense',
 'md1_sec_tense',
 'md1_degree',
 'md1_typems',
 'md1_dialects',
 'md1_sp',
 'md1_morph',
 'md1_gram_dif',
 'md2_workw_bc']


## 3.2 - Examine datastructure <a class="anchor" id="bullet3x2"></a>

First dump one example analysis

In [15]:
import pprint
api_endpoint = "10.0.1.156:1315"  # Morpheus service API IP&port
resultdata=morphkit.analyse_word_with_morpheus('kai\\', api_endpoint,debug=False)  #u)mmi/^n
pprint.pprint(resultdata)

{'analyses': [{'end_codes': ['conj'],
               'end_flags': ['indeclform'],
               'lem_base_bc': 'kai/',
               'lem_base_uc': 'καί',
               'lem_full_bc': 'kai/',
               'lem_full_uc': 'καί',
               'morph': 'CONJ',
               'pos': 'conjunction',
               'raw_bc': 'kai\\',
               'raw_uc': 'καὶ',
               'stem_bc': 'kai/',
               'stem_flags': ['indeclform'],
               'stem_uc': 'καί',
               'workw_bc': 'kai/',
               'workw_uc': 'καί'}],
 'blocks': 1,
 'raw_bc': 'kai\\',
 'raw_uc': 'καὶ'}


## 3.3 - Prepare featuredata<a class="anchor" id="bullet3x3"></a>

First, we need to define all the dictionaries that will store the TF feature data generated from the Morpheus output.

In [17]:
from collections import defaultdict
from typing import DefaultDict, Dict

# ----------- Morpheus Meta Features ----------
# These features are providing information for the 
# full analysis performed on a single word

# Generic (str):
mm_raw_bc_Dict         = {}
mm_raw_uc_Dict         = {}

# Calculated values (int):
mm_num_blocks_Dict     = {}
mm_num_lemmas_Dict     = {}
mm_num_morphs_Dict     = {}
mm_max_sim_Dict        = {}
# Calculated values (str):
mm_gram_dif_Dict       = {}

# --------- Morpheus Summary Features ---------
# These features provide a summary view on the
# returned analytic blocks per individual lemma

# Lemma related (str):
ms_lem_full_bc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_lem_full_uc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_lem_base_bc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_lem_base_uc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_lem_homonym_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_lem_pl_suff_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# prvb related (str):
ms_prvb_bc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_prvb_uc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# Morph related:
ms_morph_Dict       : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_morph_sim_Dict   : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
ms_num_morphs_Dict  : DefaultDict[int, Dict[int, int]] = defaultdict(dict)
ms_gram_dif_Dict    : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# Block accounting related:
ms_num_blocks_Dict  : DefaultDict[int, Dict[int, int]] = defaultdict(dict)
ms_block_nums_Dict  : DefaultDict[int, Dict[int, str]] = defaultdict(dict)  # pseudo-list->slash separates str

# --------- Morpheus Details Features ---------
# These features provide a detailed view on each
# separate analytic blocks received

# row ':workw'
md_workw_bc_Dict    : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_workw_uc_Dict    : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# row ':lem'
md_lem_full_bc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_lem_full_uc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_lem_base_bc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_lem_base_uc_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_lem_homonym_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_lem_pl_suff_Dict : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# row ':prvb'
md_prvb_bc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_prvb_uc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
    
# row ':aug1'
md_aug1_bc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_aug1_uc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# row ':stem'
md_stem_bc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_stem_uc_Dict     : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_stem_codes_Dict  : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_stem_flags_Dict  : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# row ':suff' : EMPTY for GNT

# row ':end'
md_end_bc_Dict      : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_end_uc_Dict      : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_end_codes_Dict   : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_end_flags_Dict   : DefaultDict[int, Dict[int, str]] = defaultdict(dict)

# 'detailed type'
md_gender_Dict      : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_number_Dict      : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_case_Dict        : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_person_Dict      : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_voice_Dict       : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_mood_Dict        : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_tense_Dict       : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_sec_tense_Dict   : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_dialects_Dict    : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_degree_Dict      : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_typems_Dict      : DefaultDict[int, Dict[int, str]] = defaultdict(dict) # type of pronoun

# Calculated properties
md_sp_Dict          : DefaultDict[int, Dict[int, str]] = defaultdict(dict)
md_morph_Dict       : DefaultDict[int, Dict[int, str]] = defaultdict(dict) # pseudo-list->slash separates str
md_gram_dif_Dict    : DefaultDict[int, Dict[int, str]] = defaultdict(dict) # field as string

With all dictionaries initialized, we can now iterate over every word node in the dataset and populate it with Morpheus-derived features.

In [19]:
import morphkit
from typing import Dict, Any
from itertools import zip_longest
import unicodedata
from tqdm.std import tqdm
api_endpoint = "10.0.1.156:1315"  # Morpheus API service IP&port
import logging
logging.basicConfig(level=logging.WARNING)
debug=False

def norm(s): # wordt niet meer gebruikt!
    return unicodedata.normalize('NFC', s) if isinstance(s, str) else s


def diffGramMap(n1904Map: dict, morphMap: dict) -> str:
    """
    Compare two grammatical maps (stored as dicts) and return a 9-character string where:
      - each character corresponds to a position (0–8) in a field 'lpcngtmvd': lemma, person,
        case, number, gender, tense, mood, voice, degree.
      - a letter from mismatchLetters indicates a mismatch where Morpheus has other options than N1904,
      - a dot '.' indicates exact match or both are None.
      - if val2 is an empty list and val1 is None it will be treated as a match.
    """
    mismatchLetters = "lpcngtmvd"
    result = []

    for i in range(9):
        val1 = n1904Map.get(i)
        val2 = morphMap.get(i)
        # normalize morphMap value to a set
        if val2 is None:
            morphVals = set()
        elif isinstance(val2, list):
            morphVals = set(val2)
        else:
            morphVals = {val2}
        # handle N1904 value
        if val1 is None:
            is_match = not morphVals  # both None/empty
        else:
            is_match = morphVals == {val1}  # must match exactly

        if not is_match:
            result.append(mismatchLetters[i])
        else:
            result.append('.')  # match

    return ''.join(result)


def gramMapOr(map1: str, map2: str) -> str:
    """
    Perform logical OR on two 9-character strings.  Each character is either a feature letter or '.'.
    Returns a new string with combined mismatch indicators.
    """
    if len(map1) != 9 or len(map2) != 9:
        raise ValueError("Both input strings must be exactly 9 characters long")

    result = []

    for a, b in zip(map1, map2):
        if a != '.':               # first indicates mismatch
            result.append(a)
        elif b != '.':             # else if second indicates mismatch
            result.append(b)
        else:
            result.append('.')

    return ''.join(result)



def join_slash(feature, mapping=None, sep="/"):
    """
    Map each item in `feature` through `mapping` (if provided) and join with `sep`.
    - If feature is a list or tuple, map each element (or leave as-is) and join.
    - If feature is a single value, map it (or leave as-is) and return.
    - If feature is None or empty, return empty string.
    """
    # Intentional: feature=[], feature=0, or feature="" all yield None.
    if not feature:
        return None
        
    # Helper to apply mapping if present, report if not
    def _map(item):
        if mapping:
            if item in mapping:
                return mapping[item]
            else:
                logging.warning(f"Unmapped feature code: {item}")
                return item
        return item

    if isinstance(feature, (list, tuple)):
        ret_val= sep.join(_map(item) for item in feature)
        #print (f'2: {ret_val}') 
        return ret_val
    return _map(feature)


word_node_list=F.otype.s('word')

skipped_words=0
total_words   = len(word_node_list)
retries=0
counter=0

# Define the features in order (lpcngtmvd) for the gram_flip_map. It will be build per wordnode using dictionary comprehension
N1904_features = ["lemma", "person", "case",  "number", "gender", "tense", "mood", "voice", "degree"]
# There is a slight difference in naming. It will be build per Morpheus analytic block using dictionary comprehension
Morph_features = ["lem_base_uc", "person", "case",  "number", "gender", "tense", "mood", "voice", "degree"]

# some mapping is required translating Morpheus values into the way to N1904-TF features represent these values
PERSON_MAP  = {"1st": "p1", "2nd": "p2", "3rd": "p3"}
CASE_MAP    = {"nom": "nominative", "acc": "accusative", "gen": "genitive", "dat": "dative", "abl": "ablative", "voc": "vocative" }
NUMBER_MAP  = {"sg": "singular", "pl": "plural", "dual": "dual"}
GENDER_MAP  = {"masc": "masculine", "fem": "feminine", "neut": "neuter"}
TENSE_MAP   = {"pres": "present", "imperf": "imperfect", "fut": "future", "aor": "aorist", "perf": "perfect", "plup": "pluperfect" }
MOOD_MAP    = {"ind": "indicative", "subj": "subjunctive", "opt": "optative", "imperat": "imperative", "inf": "infinitive", "part": "participle"}
VOICE_MAP   = {"act": "active", "middle": "middle", "pass": "passive", "mp": "middlePassive"  }
DEGREE_MAP  = {"comp": "comparative", "sup": "superlative"}
# mappings by feature-index
FEATURE_MAPS = { 1: PERSON_MAP, 2: CASE_MAP, 3: NUMBER_MAP, 4: GENDER_MAP, 5: TENSE_MAP, 6: MOOD_MAP, 7: VOICE_MAP, 8: DEGREE_MAP}
# Part of Speech mapping
POS_MAP = { "conjunction"          : "conj",   "verb"                    : "verb",    "particle"                : "verb",  
            "adjective"            : "adjv",   "preposition"             : "prep",    "numeral"                 : "num",
            "article"              : "art",    "interjection"            : "intj",    "numeral indeclinable"    : "num",
            "demonstrative pronoun": "pron",   "relative pronoun"        : "pron",    "personal pronoun"        : "pron",
            "indefinite pronoun"   : "pron",   "interrogative pronoun"   : "pron",    "demonstrative pronoun"   : "pron",
            "noun"                 : "noun",   "proper noun indeclinable": "noun",    "noun other indeclinable" : "noun", 
            "adverb"               : "advb",   "particle"                : "part",     # particle map to in N1904-TF ???
          }

# Looping over word nodes and populating node feature dictionaries
for word_node in tqdm(word_node_list, desc="Processing word nodes"):
    if debug:print('\n')
    # ------------------ per single word ------------------
    
    counter += 1

    # Fetch the “raw” lemma and morph for reference
    N1904_lemma = F.lemma.v(word_node)
    N1904_morph = F.morph.v(word_node)
    
    # reset cumulative grammar maps
    ms_gram_dif = '.........'  # 9 dots ; the map per lemma
    ma_gram_dif = '.........'  # the map for all analytic blocks
    # Use dictionary comprehension to populate gram_flip_map (lpcngtmvd) for the base N1904-TF
    N1904_gram_map = {i: getattr(F, feat).v(word_node) for i, feat in enumerate(N1904_features)}
  

    # First pass: ask Morpheus for analysis of the Betacode string
    raw_bc = F.betacode.v(word_node)
    full_analysis = morphkit.analyse_word_with_morpheus(raw_bc, api_endpoint)

    # If no blocks were returned, try again after stripping out any '*' in the Betacode 
    if not full_analysis["blocks"]: 
        cleaned_bc = raw_bc.replace('*', '')            # remove all '*' characters (i.e make it lowercase)
        if cleaned_bc != raw_bc:
            retries+=1
            #print(f"Retry {raw_bc} as {cleaned_bc }")  # e.g., Retry *)asa/f as )asa/f
            full_analysis = morphkit.analyse_word_with_morpheus(cleaned_bc, api_endpoint)
        if not full_analysis["blocks"]:
            if debug:
                print(f"Skipping {word_node} — no analysis returned")
            skipped_words+=1
            continue  # Safely skip to next word

    # Now that we (hopefully) have some blocks, populate the “mm_” dictionaries:
    mm_raw_bc_Dict[word_node]      = full_analysis["raw_bc"]
    mm_raw_uc_Dict[word_node]      = full_analysis["raw_uc"]
    mm_num_blocks_Dict[word_node]  = full_analysis["blocks"]

    # Annotate and sort analyses by reference morph and lemma
    full_analysis = morphkit.annotate_and_sort_analyses(
        full_analysis,
        reference_morph = N1904_morph,
        reference_lemma = N1904_lemma
    )

    prev_lem_full_bc     = ''
    num_lemmas           = 0        # this is a super important variable as it is key to the summary features
    ms_num_blocks        = 1 
    
    seen_morphs          = set()    # 'per lemma' accounting
    ms_num_morphs        = 0      
    
    total_seen_morphs    = set()    # 'per raw word' accounting
    total_unique_morphs  = set()
    mm_num_morphs        = 0
    mm_max_sim           = 0

    # Iterate over now-sorted analysis blocks
    for block, block_data in enumerate(full_analysis['analyses'], start=1):
        if debug: print (f'{word_node=} {block=}')
            
        # ------------------ Per analytic block  ------------------

        # populate the gram_dif map
        Morph_gram_map = {}
        for i, feat in enumerate(Morph_features):
            raw = block_data.get(feat)
            if raw is None:
                Morph_gram_map[i] = None
            else:
                mapper = FEATURE_MAPS.get(i)
                # if it’s a list, map each element (we have to be carefull here...)
                if isinstance(raw, (list, tuple)):
                    mapped = [mapper.get(r, r) if mapper else r for r in raw]
                    # choose first match 
                    Morph_gram_map[i] = mapped[0] if len(mapped)==1 else mapped
                else:
                    Morph_gram_map[i] = mapper.get(raw, raw) if mapper else raw
        # Now store gramm diff between N1904-TF and current block
        md_gram_dif = diffGramMap(N1904_gram_map, Morph_gram_map)
        md_gram_dif_Dict[block][word_node]         = md_gram_dif
        
      
        # New lemma group detection and summary initialization
        
        lem_full_bc = block_data.get('lem_full_bc', None)
        lem_full_uc = block_data.get('lem_full_uc', None)
        
        if lem_full_bc != prev_lem_full_bc:
            # ─── new lemma group ───────────────────────────────
            if debug: print (f'new lemma group {lem_full_bc}')
            prev_lem_full_bc  = lem_full_bc
            num_lemmas        += 1
            ms_num_blocks     = 1  # reset unique tag counter

            # initial compare
            ms_gram_dif = md_gram_dif
            # increment
            ma_gram_dif = gramMapOr(md_gram_dif, ma_gram_dif)

            # record all lemma related details in the summary dicts
            ms_lem_full_bc_Dict[num_lemmas][word_node] = lem_full_bc   
            ms_lem_full_uc_Dict[num_lemmas][word_node] = lem_full_uc    
            ms_lem_base_bc_Dict[num_lemmas][word_node] = block_data.get('lem_base_bc')
            ms_lem_base_uc_Dict[num_lemmas][word_node] = block_data.get('lem_base_uc')
            ms_lem_homonym_Dict[num_lemmas][word_node] = block_data.get('lem_homonym', None) 
            ms_lem_pl_suff_Dict[num_lemmas][word_node] = block_data.get('lem_pl_suff', None) 
            
            # record the prvb related details in the summary dicts
            ms_prvb_bc_Dict[num_lemmas][word_node]     = join_slash(block_data.get('prvb_bc', None),sep=' ')   # is a list (join with spaces due to betacode!)
            ms_prvb_uc_Dict[num_lemmas][word_node]     = join_slash(block_data.get('prvb_uc', None))           # is a list (join with slashes)

            # reset accumulators for morph and similarity
            ms_morph       = ''
            ms_morph_sim   = ''
            
            seen_morphs.clear()   # 'per lemma' accounting
            ms_num_morphs    = 0

            # record first block number for this lemma group
            ms_block_nums = str(block)
        else:

            # logical OR it
            ms_gram_dif = gramMapOr(md_gram_dif, ms_gram_dif)
            ma_gram_dif = gramMapOr(md_gram_dif, ma_gram_dif)
            
            # same lemma, extend block numbers
            ms_num_blocks += 1
            ms_block_nums += '/' + str(block)

        # ─── Extract raw morph and similarity, then split into parts ──────────────────
        raw_morph = block_data.get('morph', '')             # e.g. "N-GSM/N-GSN"
        raw_sim   = block_data.get('morph_similarity', '')  # e.g. "100/96"

        morph_parts = [m.strip() for m in raw_morph.split('/') if m.strip()]
        sim_parts   = [s.strip() for s in raw_sim.split('/') if s.strip()]

        # Determine the max similairity for this word
        for part in sim_parts:
            if part.isdigit() and int(part) > mm_max_sim:
                mm_max_sim = int(part)

        # Diagnostic: lengths should match NOT SURE IF THIS IS RELEVANT ANY MORE!
        if len(morph_parts) != len(sim_parts):
            import logging
            logging.warning(
                f"Length mismatch: morph_parts={morph_parts} (len={len(morph_parts)}),"
                f" sim_parts={sim_parts} (len={len(sim_parts)}) for word_node={word_node}, block={block}"
            )

        # ─── Build mapping from tag to its best similarity (first occurrence) ──────────────────
        paired = list(zip_longest(morph_parts, sim_parts, fillvalue='0'))
        tag_sim_map: Dict[str, str] = {}
        for tag, sim in paired:
            # ensure sim is numeric, else default to '0'
            tag_sim_map.setdefault(tag, sim if sim.isdigit() else '0')

        # ─── Accumulate unique tags and sim numbers per block  ──────────────────
        for tag in morph_parts:  
            if tag not in seen_morphs:         
                seen_morphs.add(tag)

                sim_value = tag_sim_map.get(tag, '0') 
                if ms_morph:
                    ms_morph     += '/' + tag
                    ms_morph_sim += '/' + sim_value
                else:
                    ms_morph     = tag
                    ms_morph_sim = sim_value
            if tag not in total_seen_morphs:  # this is the 'per raw word' branch
                total_seen_morphs.add(tag)


        # ─── Sort ms_morph and ms_morph_sim together by similarity (descending) ────
        if ms_morph:
            morph_list = ms_morph.split('/')
            sim_list   = ms_morph_sim.split('/')
        
            # Pair them, converting sim to int for sorting
            paired = []
            for tag, sim_str in zip(morph_list, sim_list):
                try:
                    sim_val = int(sim_str)
                except ValueError:
                    sim_val = 0
                paired.append((tag, sim_str, sim_val))
        
            # Sort by the integer similarity (index 2), descending
            paired_sorted = sorted(paired, key=lambda x: x[2], reverse=True)
        
            # Remove duplicate morph‐tags for *this* lemma group
            seen = set()
            unique_pairs = []
            for tag, sim_str, sim_val in paired_sorted:
                if tag not in seen:
                    seen.add(tag)
                    ms_num_morphs += 1   # this is for the 'per lemma' accounting
                    unique_pairs.append((tag, sim_str))
                    # once we've kept this tag, any further occurrences are dropped
        
            # Unzip back into two lists (preserving the chosen string form for sim)
            morph_list_sorted = [tag for tag, sim_str in unique_pairs]
            sim_list_sorted   = [sim_str for tag, sim_str in unique_pairs]
        
            # Re‐join into slash‐delimited strings
            ms_morph     = '/'.join(morph_list_sorted)
            ms_morph_sim = '/'.join(sim_list_sorted)
            
        # Now store the data for the summary features. This happens every itteration of block.
        # But the dictionairy is indexed based on numb_lemma, so the the data is incremented until
        # the last block for a certain lemma is processed.
        ms_morph_Dict[num_lemmas][word_node]         = ms_morph
        ms_morph_sim_Dict[num_lemmas][word_node]     = ms_morph_sim
        # ─── Store updated block numbers and count so far ──────────────────
        ms_block_nums_Dict[num_lemmas][word_node]    = ms_block_nums
        ms_num_blocks_Dict[num_lemmas][word_node]    = ms_num_blocks
        ms_num_morphs_Dict[num_lemmas][word_node]    = ms_num_morphs
        ms_gram_dif_Dict[num_lemmas][word_node]      = ms_gram_dif
        if debug==True: print (f'{word_node}-{block}-{num_lemmas}: {ma_gram_dif=:10}  {ms_gram_dif=:10} {md_gram_dif=:10}  {ms_morph=:20} ')

        # ─── Now fill in all the detailed feature‐dictionaries ──────────────────

        # row ':workw'
        md_workw_bc_Dict[block][word_node]           = block_data.get('workw_bc', None)
        md_workw_uc_Dict[block][word_node]           = block_data.get('workw_uc', None)

        # row ':prvb'
        md_prvb_bc_Dict[block][word_node]            = join_slash(block_data.get('prvb_bc', None),sep=' ')   # is a list (join with spaces)
        md_prvb_uc_Dict[block][word_node]            = join_slash(block_data.get('prvb_uc', None))           # is a list (join with slashes)

        # row ':aug1'
        md_aug1_bc_Dict[block][word_node]            = block_data.get('aug1_bc', None)
        md_aug1_uc_Dict[block][word_node]            = block_data.get('aug1_uc', None)


        # row ':lem'
        md_lem_full_bc_Dict[block][word_node]        = block_data.get('lem_full_bc', None)
        md_lem_full_uc_Dict[block][word_node]        = block_data.get('lem_full_uc', None)
        md_lem_base_bc_Dict[block][word_node]        = block_data.get('lem_base_bc', None)
        md_lem_base_uc_Dict[block][word_node]        = block_data.get('lem_base_uc', None)
        md_lem_homonym_Dict[block][word_node]        = block_data.get('lem_homonym', None)
        md_lem_pl_suff_Dict[block][word_node]        = block_data.get('lem_pl_suff', None) 

        # row ':stem'
        md_stem_bc_Dict[block][word_node]            = block_data.get('stem_bc', None)
        md_stem_uc_Dict[block][word_node]            = block_data.get('stem_uc', None)
        md_stem_codes_Dict[block][word_node]         = join_slash(block_data.get('stem_codes', None))         # is a list
        md_stem_flags_Dict[block][word_node]         = join_slash(block_data.get('stem_flags', None))         # is a list

        # row ':end'
        md_end_bc_Dict[block][word_node]             = block_data.get('end_bc', None)
        md_end_uc_Dict[block][word_node]             = block_data.get('end_uc', None)
        md_end_codes_Dict[block][word_node]          = join_slash(block_data.get('end_codes', None))          # is a list
        md_end_flags_Dict[block][word_node]          = join_slash(block_data.get('end_flags', None))          # is a list

        # 'detailed type' - also tranlate the Morpheus values to the N1904-TF values
        md_gender_Dict[block][word_node]             = join_slash(block_data.get('gender', None), GENDER_MAP) # can be list 
        md_number_Dict[block][word_node]             = join_slash(block_data.get('number', None), NUMBER_MAP) # can be list
        md_case_Dict[block][word_node]               = join_slash(block_data.get('case', None), CASE_MAP)     # can be list
        #print (f' {md_case_Dict[block][word_node]} {type(md_case_Dict[block][word_node])}')
        md_person_Dict[block][word_node]             = PERSON_MAP.get(block_data.get('person', None))            
        md_voice_Dict[block][word_node]              = VOICE_MAP.get(block_data.get('voice', None))
        md_mood_Dict[block][word_node]               = MOOD_MAP.get(block_data.get('mood', None))
        md_tense_Dict[block][word_node]              = TENSE_MAP.get(block_data.get('tense', None))
        md_sec_tense_Dict[block][word_node]          = block_data.get('sec_tense', None)
        md_degree_Dict[block][word_node]             = DEGREE_MAP.get(block_data.get('degree', None))
        md_dialects_Dict[block][word_node]           = join_slash(block_data.get('dialects', None))             # can be list 
        md_typems_Dict[block][word_node]             = block_data.get('pron_type', None)
        
        # Calculated properties    
        md_sp_Dict[block][word_node]                 = POS_MAP.get(block_data.get('pos', None))
        md_morph_Dict[block][word_node]              = block_data.get('morph', None)
        

    # ------------------ General for the analysis of a single word ------------------
    # Remove duplicate morph‐tags for this raw word (taking all lemmas together!)     
    for tag in total_seen_morphs:
        if tag not in total_unique_morphs:
            total_unique_morphs.add(tag)
            mm_num_morphs += 1   # this is for the 'per raw word' accounting
    
    # Record total lemma and morph count and similarity for this word_node
    mm_num_lemmas_Dict[word_node]                    = num_lemmas
    mm_num_morphs_Dict[word_node]                    = mm_num_morphs
    mm_max_sim_Dict[word_node]                       = mm_max_sim
    mm_gram_dif_Dict[word_node]                      = ma_gram_dif
   
    #if counter==500: break   # for testing with a small batch

pct = (skipped_words / counter * 100)
print(f"Analysis finished for {counter} words. {skipped_words} words did not provide Morpheus analytic blocks ({pct:.1f}%)")
print(f"{retries} words were retried after removing capitals")

Processing word nodes:  32%|███▏      | 43581/137779 [14:03<28:32, 55.01it/s]

{'raw_bc': 'dei=n', 'raw_uc': 'δεῖν', 'workw_bc': 'dei=n', 'workw_uc': 'δεῖν', 'lem_full_bc': 'dei=', 'lem_full_uc': 'δεῖ', 'lem_base_bc': 'dei=', 'lem_base_uc': 'δεῖ', 'stem_bc': 'dei=n', 'stem_uc': 'δεῖν', 'stem_flags': ['impersonal'], 'tense': 'pres', 'mood': 'part', 'gender': 'neut', 'case': ['nom', 'voc', 'acc'], 'number': 'sg', 'end_flags': ['impersonal'], 'end_codes': ['ew_pr'], 'pos': 'verb'}


Processing word nodes:  59%|█████▉    | 81072/137779 [25:59<19:28, 48.53it/s]

{'raw_bc': 'dei=n', 'raw_uc': 'δεῖν', 'workw_bc': 'dei=n', 'workw_uc': 'δεῖν', 'lem_full_bc': 'dei=', 'lem_full_uc': 'δεῖ', 'lem_base_bc': 'dei=', 'lem_base_uc': 'δεῖ', 'stem_bc': 'dei=n', 'stem_uc': 'δεῖν', 'stem_flags': ['impersonal'], 'tense': 'pres', 'mood': 'part', 'gender': 'neut', 'case': ['nom', 'voc', 'acc'], 'number': 'sg', 'end_flags': ['impersonal'], 'end_codes': ['ew_pr'], 'pos': 'verb'}


Processing word nodes:  59%|█████▉    | 81283/137779 [26:03<17:21, 54.24it/s]

{'raw_bc': 'dei=n', 'raw_uc': 'δεῖν', 'workw_bc': 'dei=n', 'workw_uc': 'δεῖν', 'lem_full_bc': 'dei=', 'lem_full_uc': 'δεῖ', 'lem_base_bc': 'dei=', 'lem_base_uc': 'δεῖ', 'stem_bc': 'dei=n', 'stem_uc': 'δεῖν', 'stem_flags': ['impersonal'], 'tense': 'pres', 'mood': 'part', 'gender': 'neut', 'case': ['nom', 'voc', 'acc'], 'number': 'sg', 'end_flags': ['impersonal'], 'end_codes': ['ew_pr'], 'pos': 'verb'}


Processing word nodes: 100%|██████████| 137779/137779 [44:17<00:00, 51.84it/s]

Analysis finished for 137779 words. 2124 words did not provide Morpheus analytic blocks (1.5%)
554 words were retried after removing capitals





## 3.4 - Link metadata to the featuredata<a class="anchor" id="bullet3x4"></a>

Now we give the new feature its name, and connect it with the data dictionary and the metadata dictionary. 

In [20]:
metadata ={ 
    'mm_raw_bc'      : mm_raw_bc_Metadata,
    'mm_raw_uc'      : mm_raw_uc_Metadata,
    'mm_num_blocks'  : mm_num_blocks_Metadata,
    'mm_num_lemmas'  : mm_num_lemmas_Metadata,
    'mm_num_morphs'  : mm_num_morphs_Metadata,
    'mm_max_sim'     : mm_max_sim_Metadata,
    'mm_gram_dif'    : mm_gram_dif_Metadata,
    
    #  Now just pull in everything from ms_metadata:
    **ms_metadata,
    
    #  Now just pull in everything from md_metadata:
    **md_metadata
}

In [21]:
nodedata = {

    # ---------------------------- Morpheus Meta Features --------------------------------------
    # Map the Morpheus Metadata (mm_xxx) dictionairies
    'mm_raw_bc'      : mm_raw_bc_Dict,
    'mm_raw_uc'      : mm_raw_uc_Dict,
    'mm_num_blocks'  : mm_num_blocks_Dict,
    'mm_num_lemmas'  : mm_num_lemmas_Dict,
    'mm_num_morphs'  : mm_num_morphs_Dict,
    'mm_max_sim'     : mm_max_sim_Dict,
    'mm_gram_dif'    : mm_gram_dif_Dict,

    # --------------------------- Morpheus Summary Features -------------------------------------
    # unpack all Morpheus Summary (ms_xxx) comprehensions, one per field

    # Lemma related:
    **{ f'ms{i}_lem_full_bc'   : ms_lem_full_bc_Dict[i]   for i in range(1, number_of_ms_sets) },
    **{ f'ms{i}_lem_full_uc'   : ms_lem_full_uc_Dict[i]   for i in range(1, number_of_ms_sets) },
    **{ f'ms{i}_lem_base_bc'   : ms_lem_base_bc_Dict[i]   for i in range(1, number_of_ms_sets) },
    **{ f'ms{i}_lem_base_uc'   : ms_lem_base_uc_Dict[i]   for i in range(1, number_of_ms_sets) },
    **{ f'ms{i}_lem_homonym'   : ms_lem_homonym_Dict[i]   for i in range(1, number_of_ms_sets) },
    **{ f'ms{i}_lem_pl_suff'   : ms_lem_pl_suff_Dict[i]   for i in range(1, number_of_ms_sets) },

    # prvb related:
    **{ f'ms{i}_prvb_bc'       : ms_prvb_bc_Dict[i]       for i in range(1, number_of_ms_sets) },
    **{ f'ms{i}_prvb_uc'       : ms_prvb_uc_Dict[i]       for i in range(1, number_of_ms_sets) },

    # Morph related:
    **{ f'ms{i}_morph'         : ms_morph_Dict[i]         for i in range(1, number_of_ms_sets) }, 
    **{ f'ms{i}_morph_sim'     : ms_morph_sim_Dict[i]     for i in range(1, number_of_ms_sets) },
    **{ f'ms{i}_num_morphs'    : ms_num_morphs_Dict[i]    for i in range(1, number_of_ms_sets) }, 
    **{ f'ms{i}_gram_dif'      : ms_gram_dif_Dict[i]      for i in range(1, number_of_ms_sets) }, 
    
    # Block accounting related:
    **{ f'ms{i}_num_blocks'    : ms_num_blocks_Dict[i]    for i in range(1, number_of_ms_sets) }, 
    **{ f'ms{i}_block_nums'    : ms_block_nums_Dict[i]    for i in range(1, number_of_ms_sets) },    


    # --------------------------- Morpheus Details Features -------------------------------------
    # unpack all Morpheus Detailed (md_xxx) comprehensions, one per field
    # Main parts of each Morpheus analyses block

    # row ':workw'
    **{ f'md{i}_workw_bc'      : md_workw_bc_Dict[i]      for i in range(1, number_of_md_sets) },
    **{ f'md{i}_workw_uc'      : md_workw_uc_Dict[i]      for i in range(1, number_of_md_sets) },

    # row ':lem'
    **{ f'md{i}_lem_full_bc'   : md_lem_full_bc_Dict[i]   for i in range(1, number_of_md_sets) },
    **{ f'md{i}_lem_full_uc'   : md_lem_full_uc_Dict[i]   for i in range(1, number_of_md_sets) },
    **{ f'md{i}_lem_base_bc'   : md_lem_base_bc_Dict[i]   for i in range(1, number_of_md_sets) },
    **{ f'md{i}_lem_base_uc'   : md_lem_base_uc_Dict[i]   for i in range(1, number_of_md_sets) },
    **{ f'md{i}_lem_homonym'   : md_lem_homonym_Dict[i]   for i in range(1, number_of_md_sets) },
    **{ f'md{i}_lem_pl_suff'   : md_lem_pl_suff_Dict[i]   for i in range(1, number_of_md_sets) },

    # row ':prvb'
    **{ f'md{i}_prvb_bc'       : md_prvb_bc_Dict[i]       for i in range(1, number_of_md_sets) },
    **{ f'md{i}_prvb_uc'       : md_prvb_uc_Dict[i]       for i in range(1, number_of_md_sets) },

    # row ':aug1'    
    **{ f'md{i}_aug1_bc'       : md_aug1_bc_Dict[i]       for i in range(1, number_of_md_sets) },
    **{ f'md{i}_aug1_uc'       : md_aug1_uc_Dict[i]       for i in range(1, number_of_md_sets) },

    # row ':stem'
    **{ f'md{i}_stem_bc'       : md_stem_bc_Dict[i]       for i in range(1, number_of_md_sets) },
    **{ f'md{i}_stem_uc'       : md_stem_uc_Dict[i]       for i in range(1, number_of_md_sets) },
    **{ f'md{i}_stem_codes'    : md_stem_codes_Dict[i]    for i in range(1, number_of_md_sets) },
    **{ f'md{i}_stem_flags'    : md_stem_flags_Dict[i]    for i in range(1, number_of_md_sets) },

    # row ':end'
    **{ f'md{i}_end_bc'        : md_end_bc_Dict[i]        for i in range(1, number_of_md_sets) },
    **{ f'md{i}_end_uc'        : md_end_uc_Dict[i]        for i in range(1, number_of_md_sets) },
    **{ f'md{i}_end_codes'     : md_end_codes_Dict[i]     for i in range(1, number_of_md_sets) },
    **{ f'md{i}_end_flags'     : md_end_flags_Dict[i]     for i in range(1, number_of_md_sets) },

    # Morphological details from Morpheus analytic block
    **{ f'md{i}_gender'        : md_gender_Dict[i]        for i in range(1, number_of_md_sets) },
    **{ f'md{i}_number'        : md_number_Dict[i]        for i in range(1, number_of_md_sets) },
    **{ f'md{i}_case'          : md_case_Dict[i]          for i in range(1, number_of_md_sets) },
    **{ f'md{i}_person'        : md_person_Dict[i]        for i in range(1, number_of_md_sets) },
    **{ f'md{i}_voice'         : md_voice_Dict[i]         for i in range(1, number_of_md_sets) },
    **{ f'md{i}_mood'          : md_mood_Dict[i]          for i in range(1, number_of_md_sets) },
    **{ f'md{i}_tense'         : md_tense_Dict[i]         for i in range(1, number_of_md_sets) },   
    **{ f'md{i}_sec_tense'     : md_sec_tense_Dict[i]     for i in range(1, number_of_md_sets) },   
    **{ f'md{i}_degree'        : md_degree_Dict[i]        for i in range(1, number_of_md_sets) },
    **{ f'md{i}_gram_dif'      : md_gram_dif_Dict[i]      for i in range(1, number_of_md_sets) }, 
    **{ f'md{i}_typems'        : md_typems_Dict[i]        for i in range(1, number_of_md_sets) }, 

    # Combined from multiple lines
    **{ f'md{i}_dialects'      : md_dialects_Dict[i]      for i in range(1, number_of_md_sets) },   

    # Calculated values derived from Morpheus analytic block
    **{ f'md{i}_sp'            : md_sp_Dict[i]            for i in range(1, number_of_md_sets) },
    **{ f'md{i}_morph'         : md_morph_Dict[i]         for i in range(1, number_of_md_sets) },
}


## 3.5 - Save the feature to files<a class="anchor" id="bullet3x5"></a>

Now we save the new feature to its own `.tf` file.

If you don’t pass an explicit target path, `TF.save()` writes the file to the directory that already contains the loaded corpus—in this case the local on‑disk copy of the N1904 Text‑Fabric dataset.

In [22]:
TF.save(nodeFeatures=nodedata, metaData=metadata, silent="terse")  # silent="terse"

True

## 3.6 - Reload Text-Fabric with the new feature <a class="anchor" id="bullet3x6"></a>

Next we’ll confirm that Text‑Fabric can pick up the new feature.

Because the `betacode.tf` file lives in *the same directory* as the rest of the N1904 dataset that we initialy downloaded, we can use the very same 'use()' call as before in <a href="#bullet2">step 2</a>. The only change is that we bind the result to a different instance (N1904_ADD instead of N1904) so both the enriched and the original dataset can be inspected side‑by‑side.

In [23]:
# load the N1904-TF app and data in another instance 
N1904_ADD = use ('CenterBLC/N1904', mod="tonyjurg/N1904addons/tf/", silent="terse", hoist=globals())

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [25]:
F.lemma.freqList()

(('ὁ', 39566),
 ('καί', 17956),
 ('αὐτός', 11122),
 ('σύ', 5784),
 ('δέ', 5574),
 ('ἐν', 5486),
 ('ἐγώ', 5134),
 ('εἰμί', 4914),
 ('λέγω', 4510),
 ('εἰς', 3532),
 ('οὐ', 3244),
 ('ὅς', 2814),
 ('οὗτος', 2776),
 ('θεός', 2622),
 ('ὅτι', 2586),
 ('πᾶς', 2484),
 ('μή', 2120),
 ('γάρ', 2076),
 ('ἐκ', 1826),
 ('Ἰησοῦς', 1826),
 ('ἐπί', 1772),
 ('κύριος', 1436),
 ('ἔχω', 1418),
 ('πρός', 1400),
 ('ὁράω', 1364),
 ('ἵνα', 1338),
 ('γίνομαι', 1336),
 ('διά', 1334),
 ('ἀπό', 1294),
 ('ἀλλά', 1276),
 ('ἔρχομαι', 1270),
 ('ποιέω', 1132),
 ('τίς', 1124),
 ('ἄνθρωπος', 1096),
 ('Χριστός', 1058),
 ('τὶς', 1058),
 ('ὡς', 1004),
 ('εἰ', 1002),
 ('οὖν', 992),
 ('κατά', 944),
 ('μετά', 940),
 ('ἀκούω', 856),
 ('δίδωμι', 828),
 ('πολύς', 826),
 ('πατήρ', 824),
 ('ἡμέρα', 778),
 ('πνεῦμα', 758),
 ('υἱός', 752),
 ('ἀδελφός', 686),
 ('ἤ', 686),
 ('ἐάν', 674),
 ('εἷς', 672),
 ('περί', 666),
 ('λόγος', 662),
 ('οἶδα', 640),
 ('ἑαυτοῦ', 640),
 ('λαλέω', 596),
 ('οὐρανός', 546),
 ('μαθητής', 522),
 ('λαμβάνω', 5

In [26]:
F.mm_num_lemmas.freqList()

((1, 78324),
 (2, 29104),
 (3, 20501),
 (4, 5148),
 (5, 2124),
 (6, 154),
 (8, 142),
 (7, 99),
 (9, 42),
 (13, 10),
 (12, 6),
 (11, 1))

In [27]:
F.mm_num_morphs.freqList()

((1, 52729),
 (2, 24278),
 (3, 21333),
 (4, 17746),
 (5, 6701),
 (7, 3448),
 (6, 2393),
 (8, 2161),
 (9, 1051),
 (10, 1021),
 (11, 876),
 (14, 687),
 (12, 477),
 (13, 266),
 (15, 146),
 (16, 143),
 (17, 113),
 (19, 51),
 (0, 15),
 (20, 9),
 (18, 5),
 (21, 2),
 (23, 2),
 (24, 2))

## 3.7 - Check if the new feature is loaded <a class="anchor" id="bullet3x7"></a>

This can be done easily using the 'A.isLoaded()' method which we will apply to both the initial and the expanded dataset:

In [28]:
# Define the query template
VerseQuery = '''
verse book=Acts chapter=7 verse=14
'''
VerseResult = A.search(VerseQuery)

print(VerseResult)

morphFeatureList = (
    ['lemma','morph','mm_gram_dif']
+ ['betacode','gloss']
# ['lemma','morph','mm_num_blocks','mm_num_lemmas','mm_gram_dif']
    + [f'ms{i}_lem_full_uc' for i in range(1, number_of_ms_sets)]
#   + [f'ms{i}_lem_base_bc' for i in range(1, number_of_ms_sets)]
    + [f'ms{i}_morph'      for i in range(1, number_of_ms_sets)]
   + [f'ms{i}_morph_sim'      for i in range(1, number_of_ms_sets)]
    + [f'ms{i}_gram_dif'      for i in range(1, number_of_ms_sets)]
    + [f'md{i}_sp'      for i in range(1, number_of_md_sets)]
#    + [f'md{i}_case'      for i in range(1, number_of_md_sets)]
+ [f'md{i}_morph'      for i in range(1, number_of_md_sets)]
#    + [f'md{i}_pos'      for i in range(1, number_of_md_sets)]
#    + [f'md{i}_gender'      for i in range(1, number_of_md_sets)]
#    + [f'ms{i}_num_blocks'      for i in range(1, number_of_ms_sets)]
   + [f'ms{i}_lem_full_bc'      for i in range(1, number_of_ms_sets)]
#    + [f'md{i}_dialects'      for i in range(1, number_of_md_sets)]
#    + [f'md{i}_mood'      for i in range(1, number_of_md_sets)]
#    + [f'md{i}_degree'      for i in range(1, number_of_md_sets)]
)

#A.show(VerseResult,start=1,end=70,hiddenTypes={'wg','subphrase'}, extraFeatures=morphFeatureList, queryFeatures=False)

  0.00s 1 result
[(386690,)]


In [29]:
BLOCK_RANGE  = range(1, 25)
lem_feat     = {b: Fs(f"md{b}_lem_base_uc")  for b in BLOCK_RANGE}
morph_feat   = {b: Fs(f"md{b}_morph")  for b in BLOCK_RANGE}
for wordNode in F.otype.s('word'):
    for block in BLOCK_RANGE:
        if lem_feat[block].v(wordNode)=="ἄρχω" and "V-PAP-G" in morph_feat[block].v(wordNode):
            print (T.sectionFromNode(wordNode),f'{block=}', morph_feat[block].v(wordNode))

('Matthew', 9, 23) block=2 V-PAP-GSM/V-PAP-GSN
('Luke', 14, 1) block=2 V-PAP-GPM/V-PAP-GPN
('John', 7, 48) block=2 V-PAP-GPM/V-PAP-GPN
('John', 12, 42) block=2 V-PAP-GPM/V-PAP-GPN
('I_Corinthians', 2, 6) block=2 V-PAP-GPM/V-PAP-GPN
('I_Corinthians', 2, 8) block=2 V-PAP-GPM/V-PAP-GPN


In [30]:
BLOCK_RANGE  = range(1, 25)
lem_feat     = {b: Fs(f"md{b}_lem_base_uc")  for b in BLOCK_RANGE}
morph_feat   = {b: Fs(f"md{b}_morph")  for b in BLOCK_RANGE}
for wordNode in F.otype.s('word'):
    for block in BLOCK_RANGE:
        if (val := lem_feat[block].v(wordNode)) and "μετά,ἀνά" in val:
            print (T.sectionFromNode(wordNode),f'{block=} {val=}', morph_feat[block].v(wordNode))

('Matthew', 3, 8) block=3 val='μετά,ἀνά-οἰάω' V-PAI-2S
('Matthew', 3, 8) block=4 val='μετά,ἀνά-οἰάω' V-IAI-2S
('Matthew', 3, 11) block=3 val='μετά,ἀνά-οἰάω' V-PAN
('Matthew', 3, 11) block=4 val='μετά,ἀνά-οἰάω' V-PAN
('Matthew', 3, 11) block=5 val='μετά,ἀνά-οἰάω' V-PAP-NSN/V-PAP-VSN/V-PAP-ASN-A
('Matthew', 3, 11) block=6 val='μετά,ἀνά-οἰάω' V-PAP-VSM-A
('Matthew', 3, 11) block=7 val='μετά,ἀνά-οἰάω' V-PAP-NSM-A
('Matthew', 3, 11) block=8 val='μετά,ἀνά-οἰάω' V-IAI-1S-A
('Matthew', 3, 11) block=9 val='μετά,ἀνά-οἰάω' V-IAI-3P-A
('Mark', 1, 4) block=3 val='μετά,ἀνά-οἰάω' V-PAI-2S
('Mark', 1, 4) block=4 val='μετά,ἀνά-οἰάω' V-IAI-2S
('Luke', 3, 3) block=3 val='μετά,ἀνά-οἰάω' V-PAI-2S
('Luke', 3, 3) block=4 val='μετά,ἀνά-οἰάω' V-IAI-2S
('Luke', 3, 8) block=3 val='μετά,ἀνά-οἰάω' V-PAI-2S
('Luke', 3, 8) block=4 val='μετά,ἀνά-οἰάω' V-IAI-2S
('Luke', 5, 32) block=3 val='μετά,ἀνά-οἰάω' V-PAN
('Luke', 5, 32) block=4 val='μετά,ἀνά-οἰάω' V-PAN
('Luke', 5, 32) block=5 val='μετά,ἀνά-οἰάω' V-PAP-NSN/V-PAP

In [31]:
BLOCK_RANGE  = range(1, 25)
pos_feat     = {b: Fs(f"md{b}_pos")  for b in BLOCK_RANGE}
morph_feat   = {b: Fs(f"md{b}_morph")  for b in BLOCK_RANGE}
for wordNode in F.otype.s('word'):
    for block in BLOCK_RANGE:
        if pos_feat[block].v(wordNode)!=F.sp.v(wordNode):
            print (T.sectionFromNode(wordNode),F.sp.v(wordNode), pos_feat[block].v(wordNode))
    break

 2m 15s Node feature "md1_pos" not loaded
 2m 15s Node feature "md2_pos" not loaded
 2m 15s Node feature "md3_pos" not loaded
 2m 15s Node feature "md4_pos" not loaded
 2m 15s Node feature "md5_pos" not loaded
 2m 15s Node feature "md6_pos" not loaded
 2m 15s Node feature "md7_pos" not loaded
 2m 15s Node feature "md8_pos" not loaded
 2m 15s Node feature "md9_pos" not loaded
 2m 15s Node feature "md10_pos" not loaded
 2m 15s Node feature "md11_pos" not loaded
 2m 15s Node feature "md12_pos" not loaded
 2m 15s Node feature "md13_pos" not loaded
 2m 15s Node feature "md14_pos" not loaded
 2m 15s Node feature "md15_pos" not loaded
 2m 15s Node feature "md16_pos" not loaded
 2m 15s Node feature "md17_pos" not loaded
 2m 15s Node feature "md18_pos" not loaded
 2m 15s Node feature "md19_pos" not loaded
 2m 15s Node feature "md20_pos" not loaded
 2m 15s Node feature "md21_pos" not loaded
 2m 15s Node feature "md22_pos" not loaded
 2m 15s Node feature "md23_pos" not loaded
 2m 15s Node feature

AttributeError: 'NoneType' object has no attribute 'v'

In [32]:
# get frequency list and count how many with frequency 1
freqs = F.lemma.freqList()
ones = [item for item in freqs if item[0] == 1]
print(f"{len(ones)} lemmas with frequency 1")

0 lemmas with frequency 1


In [38]:
from collections import Counter
paulStart=L.d(T.nodeFromSection(("Romans",1,1)),'word')[0]
paulEnd=L.d(T.nodeFromSection(("Philemon",1,25)),'word')[-1]

# determine all hapaxes in the NT
freqs = F.lemma.freqList('word')
hapaxLemmas = [lemma for lemma, freq in lemmaFreq.items() if freq == 1]
print (f'Total number of hapax lemmas in all of GNT: {len(hapaxLemmas)}')

# Cache all md{num}_lem_base_uc features
BLOCK_RANGE = range(1, 25)
lem_feat = {b: Fs(f"md{b}_lem_base_uc") for b in BLOCK_RANGE}

# Now compare each possible lemma in the Paul corpus with the list of hapax legoman 
nonHapax = Counter()
numNonHapax = 0
for wordNode in range(paulStart,paulEnd):
    n1904_lemma = F.lemma.v(wordNode)
    for block in BLOCK_RANGE:
        m_lemma = lem_feat[block].v(wordNode)
        if n1904_lemma != m_lemma:
            if m_lemma and m_lemma in hapaxLemmas:
                nonHapax[m_lemma] += 1

print(f'{len(nonHapax)} hapaxes in the Pauline corpus do have potential matches:\n{nonHapax}') 

NameError: name 'lemmaFreq' is not defined

In [77]:
# Cache all md{num}_lem_base_uc features
BLOCK_RANGE = range(1, 25)
lem_feat = {b: Fs(f"md{b}_lem_base_uc") for b in BLOCK_RANGE}

for wordNode in F.otype.s('word'):
    for block in BLOCK_RANGE:
        if lem_feat[block].v(wordNode)=='δολιόω':
            print(f'{T.sectionFromNode(wordNode)} {wordNode=} {block=} {F.text.v(wordNode)}')

('II_Corinthians', 11, 13) wordNode=100520 block=4 δόλιοι
('II_Corinthians', 11, 13) wordNode=100520 block=5 δόλιοι
('II_Corinthians', 11, 13) wordNode=100520 block=6 δόλιοι
('II_Corinthians', 11, 13) wordNode=100520 block=7 δόλιοι
('II_Corinthians', 11, 13) wordNode=100520 block=8 δόλιοι
('II_Corinthians', 11, 13) wordNode=100520 block=9 δόλιοι


In [84]:
nodeList=[(100520,)]
number_of_ms_sets=9
number_of_md_sets=25

morphFeatureList = (
    ['lemma','morph','gloss']
#    ['lemma','morph','mm_num_blocks','mm_num_lemmas']
    + [f'ms{i}_lem_full_uc' for i in range(1, number_of_ms_sets)]
#    + [f'ms{i}_lem_full_bc' for i in range(1, number_of_ms_sets)]
    + [f'ms{i}_morph'      for i in range(1, number_of_ms_sets)]
    + [f'ms{i}_morph_sim'      for i in range(1, number_of_ms_sets)]
#    + [f'ms{i}_num_blocks'      for i in range(1, number_of_ms_sets)]
#    + [f'md{i}_lem_base_bc'      for i in range(1, number_of_md_sets)]
)

A.show(nodeList,hiddenTypes={'wg','phrase','subphrase'}, extraFeatures=morphFeatureList, queryFeatures=False)

In [34]:
# Define the query template
VerseQuery = '''
book book=John
  chapter chapter=1
      verse verse=1|2|3|4|5
'''
VerseResult = N1904_ADD.search(VerseQuery)

number_of_ms_sets=9
number_of_md_sets=25

morphFeatureList = (
    ['lemma','morph','gloss']
#    ['lemma','morph','mm_num_blocks','mm_num_lemmas']
    + [f'ms{i}_lem_full_uc' for i in range(1, number_of_ms_sets)]
#    + [f'ms{i}_lem_full_bc' for i in range(1, number_of_ms_sets)]
    + [f'ms{i}_morph'      for i in range(1, number_of_ms_sets)]
    + [f'ms{i}_morph_sim'      for i in range(1, number_of_ms_sets)]
#    + [f'ms{i}_num_blocks'      for i in range(1, number_of_ms_sets)]
#    + [f'md{i}_lem_base_bc'      for i in range(1, number_of_md_sets)]
)

N1904_ADD.show(VerseResult,hiddenTypes={'wg','phrase','subphrase'}, extraFeatures=morphFeatureList, queryFeatures=False)

  0.02s 5 results


In [35]:
T.sectionFromNode(7363)

('Matthew', 13, 22)

In [39]:
morphkit.analyse_word_with_morpheus('gene/sews', api_endpoint)

{'raw_bc': 'gene/sews',
 'raw_uc': 'γενέσεως',
 'blocks': 1,
 'analyses': [{'raw_bc': 'gene/sews',
   'raw_uc': 'γενέσεως',
   'workw_bc': 'gene/sew^s',
   'workw_uc': 'γενέσεω̂ς',
   'lem_full_bc': 'ge/nesis',
   'lem_full_uc': 'γένεσις',
   'lem_base_bc': 'ge/nesis',
   'lem_base_uc': 'γένεσις',
   'stem_bc': 'genes',
   'stem_uc': 'γενες',
   'stem_gender': 'fem',
   'stem_codes': ['is_ews'],
   'end_bc': 'ew^s',
   'end_uc': 'εω̂ς',
   'gender': 'fem',
   'case': 'gen',
   'number': 'sg',
   'dialects': ['attic'],
   'end_codes': ['is_ews'],
   'pos': 'noun',
   'morph': 'N-GSF-ATT'}]}

In [40]:
BLOCK_RANGE  = range(1, 25)
lem_feat     = {b: Fs(f"md{b}_lem_base_uc")  for b in BLOCK_RANGE}
morph_feat   = {b: Fs(f"md{b}_morph")  for b in BLOCK_RANGE}
for wordNode in F.otype.s('word'):
    for block in BLOCK_RANGE:
        if lem_feat[block].v(wordNode)=="ἄρχω" and "V-PAP-G" in morph_feat[block].v(wordNode):
            print (T.sectionFromNode(wordNode),f'{block=}', morph_feat[block].v(wordNode))

('Matthew', 9, 23) block=2 V-PAP-GSM/V-PAP-GSN
('Luke', 14, 1) block=2 V-PAP-GPM/V-PAP-GPN
('John', 7, 48) block=2 V-PAP-GPM/V-PAP-GPN
('John', 12, 42) block=2 V-PAP-GPM/V-PAP-GPN
('I_Corinthians', 2, 6) block=2 V-PAP-GPM/V-PAP-GPN
('I_Corinthians', 2, 8) block=2 V-PAP-GPM/V-PAP-GPN


## 3.8 - Move the newly created feature to its final location <a class="anchor" id="bullet3x8"></a>

The last step is to obtain the newly created feature from location ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0 (see output of <a href="#bullet3x5">step 3.5</a>) to is final location: https://github.com/tonyjurg/N1904addons/tree/main/tf/1.0.0.

# 4 - Attribution and footnotes <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

Greek base text: Nestle1904 Greek New Testament, edited by Eberhard Nestle, published in 1904 by the British and Foreign Bible Society. Transcription by [Diego Santos](https://sites.google.com/site/nestle1904/home). Public domain.

Betacode syntax follows the TLG/Perseus convention: [Thesaurus Linguae Graecae / Perseus Project spec.](https://stephanus.tlg.uci.edu/encoding/BCM.pdf)

The conversion code between Unicode and Betacode is available at [GitHub repository perseids-tools/beta-code-py](https://github.com/perseids-tools/beta-code-py).

The [N1904-TF dataset](https://centerblc.github.io/N1904/) available under [MIT licence](https://github.com/CenterBLC/N1904/blob/main/LICENSE.md), Copyright (c) 2025 Center of Biblical Languages and Computing (CBLC). Formal reference: Tony Jurg, Saulo de Oliveira Cantanhêde, & Oliver Glanz. (2024). *CenterBLC/N1904: Nestle 1904 Text-Fabric data*. Zenodo. DOI: [10.5281/zenodo.13117911](https://doi.org/10.5281/zenodo.13117910).

The Text-Fabric features created in this notebook were added to the dataset published at [tonyjurg.github.io/N1904addons](https://tonyjurg.github.io/N1904addons/) and made available under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md) license.

The [Anaconda Asisstant](https://www.anaconda.com/capability/anaconda-assistant) (using [OpenAI](https://openai.com/) as backend) was used to debug and/or optimze the code in this Jupyter Notebook.

# 5 - Required libraries<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

Since the scripts in this notebook utilize Text-Fabric, [it requires currently (Apr 2025) Python >=3.9.0](https://pypi.org/project/text-fabric) together with the following libraries installed in the environment:

    beta_code
    unicodedata
    
You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 6 - Notebook version<a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.4</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>22 June 2025</td>
    </tr>
  </table>
</div>