# Test morph tag decoding by Morphkit (Morpheus) 

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Setting up testing environment</a>
    * <a href="#bullet2x1">2.1 - Load N1904-TF</a>
    * <a href="#bullet2x2">2.2 - Load morphkit</a>
* <a href="#bullet3">3 - Run the tests</a>    
    * <a href="#bullet3x1">3.1 - Test decoding of some morph tags</a>
    * <a href="#bullet3x2">3.2 - Run  automated test with all morphs in the GNT</a>
* <a href="#bullet4">4 - Conclusion</a>
* <a href="#bullet5">5 - Attribution and footnotes</a>
* <a href="#bullet6">6 - Required libraries</a>
* <a href="#bullet7">7 - Notebook version</a>


# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This Jupyter Notebook contains a test setup to verify the ability to decode morphological tags using the Sandborg-Petersen (SP) tagging scheme through the `decode_tag()` function, which is part of the `Morphkit` package.

After loading the required resources — including Text-Fabric, the `N1904-TF` dataset, and the `morphkit` module — the test run begins by collecting all SP-encoded morphological tags from the N1904-TF dataset of the Greek New Testament.

These collected tags are then passed to the decoder function with the `debug` flag enabled. This flag enriches the output with additional diagnostic information, which can be useful for analysis or troubleshooting.

Although the decoder generates verbose debug output, our primary interest lies in the final decoded results rather than the internal execution flow. To address this, the debug output is wrapped in a context that suppresses console noise while retaining the augmented result fields for downstream inspection.

# 2 - Setting up testing environment <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

Here we load the required resources.

## 2.1 - Load N1904-TF with N1904Addons<a class="anchor" id="bullet2x1"></a>

We can just load the plain N1904-TF dataset with the additional dataset N1904Addons since we would like to test the 
morph tags derived from Morpheus.

In [1]:
# Load the autoreload extension to automatically reload modules before executing code
%load_ext autoreload
%autoreload 2

In [2]:
# Import required modules, including Text-Fabric (tf)
from tf.app import use
import os
import sys

In [11]:
# Load the N1904-TF app and data 
A = use ("CenterBLC/N1904", version="1.0.0", mod="tonyjurg/N1904addons/tf/", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

## 2.2 - Load morphkit <a class="anchor" id="bullet2x2"></a>

In this step, we load the `morphkit` package, which contains the tag decoding functionality used in this notebook. Since `morphkit` is not yet a formally published package (e.g. on PyPI), it cannot be installed using `pip`.

Instead, we include it locally and ensure it is accessible within the notebook by modifying the Python module search path (`sys.path`). This allows us to import and use the package as if it were installed in the environment.

In [4]:
# Essential: we need to add the relative path to the DIR morphkit (which contains the actual module files) 
import sys
sys.path.insert(0, "../../morphkit")    # relative to notebook dir

import morphkit

morphkit loaded


# 3 - Run the tests <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

## 3.1 - Test decoding of some malformed morph tags <a class="anchor" id="bullet3x1"></a>

Using a context manager to suppress the many prints generated by decode_tag() while still be able to force ERROR outputs

In [5]:
morphkit.decode_tag('X-GDF-ATT',debug=True)

{'Part of Speech': 'Indefinite Pronoun',
 'Case': 'Genitive',
 'Number': 'Dual',
 'Gender': 'Feminine',
 'Suffix': 'Attic'}

In [6]:
morphkit.decode_tag('',debug=True)

[decode_tag] ERROR: Input is empty or only whitespace: ''


{'Part of Speech': 'Unknown or Unsupported', 'Error': 'Input cannot be empty'}

In [7]:
morphkit.decode_tag('N-ACF',debug=True)

[decode_tag] Return ({'Part of Speech': 'Noun', 'Case': 'Accusative', 'Number': 'Unknown', 'Gender': 'Feminine'})


{'Part of Speech': 'Noun',
 'Case': 'Accusative',
 'Number': 'Unknown',
 'Gender': 'Feminine'}

In [8]:
morphkit.decode_tag('V-RAN-ATT',debug=True)

{'Part of Speech': 'Verb',
 'Tense': 'Perfect',
 'Voice': 'Active',
 'Mood': 'Infinitive',
 'Verb extra': 'Attic'}

In [1]:
morphkit.decode_tag('V-AAI-1P',debug=True)

NameError: name 'morphkit' is not defined

## 3.2 - Run  automated test with all morphs in the GNT <a class="anchor" id="bullet3x2"></a>

### Define the common test routine

The following defines a re-usable test function.

In [16]:
import os
from contextlib import redirect_stdout
from collections import Counter

def check_morph_tags(
    tag_iterable,
    decoder,
    debug: bool = True,
    suppress_stdout: bool = True,
    verbose: bool = True
):
    """
    Run decoder(tag, debug=debug) on each tag in tag_iterable,
    collecting counts of Error-fields, Warning-fields, and Unknown-values.

    Returns a dict with counts: {
        'tested': int,
        'errors': int,
        'warnings': int,
        'unknowns': int
    }.
    If verbose, prints per-tag diagnostics.
    """
    # counters
    cnt = Counter()
    cnt['tested'] = 0
    cnt['errors'] = 0
    cnt['warnings'] = 0
    cnt['unknowns'] = 0

    for tag in tag_iterable:
        cnt['tested'] += 1

        # optionally suppress the decoder's stdout
        if suppress_stdout:
            with open(os.devnull, 'w') as devnull, redirect_stdout(devnull):
                result = decoder(tag, debug=debug)
        else:
            result = decoder(tag, debug=debug)

        # find Error keys
        errors = {k: v for k, v in result.items()
                  if isinstance(v, str) and "Error" in k}
        if errors:
            cnt['errors'] += 1
            if verbose:
                print(f"[Error]    {tag!r}:")
                for field, msg in errors.items():
                    print(f"    {field!r}: {msg}")

        # find Warning keys
        warnings = {k: v for k, v in result.items()
                    if isinstance(v, str) and "Warning" in k}
        if warnings:
            cnt['warnings'] += 1
            if verbose:
                print(f"[Warning]  {tag!r}:")
                for field, msg in warnings.items():
                    print(f"    {field!r}: {msg}")

        # find Unknown values
        unknowns = {k: v for k, v in result.items()
                    if isinstance(v, str) and "Unknown" in v}
        if unknowns:
            cnt['unknowns'] += 1
            if verbose:
                print(f"[Unknown]  {tag!r}:")
                for field, msg in unknowns.items():
                    print(f"    {field!r}: {msg}")

    # summary
    print((
        f"\nSummary: tested={cnt['tested']}; "
        f"errors={cnt['errors']}; "
        f"warnings={cnt['warnings']}; "
        f"unknowns={cnt['unknowns']}"
    ))

    return dict(cnt)


## 3.1 - Get all morphs from the N1904-TF <a class="anchor" id="bullet3x1"></a>

In [19]:
from collections import Counter
import pandas as pd

#  Accumulate into a single Counter
morphFrequency = Counter()

# itterate over all wordnodes
for wordNode in F.otype.s("word"):
    morph = F.morph.v(wordNode)    
    if not morph:
        continue
    morphFrequency[morph] += 1

# Build a DataFrame sorted by descending frequency
df1 = (
    pd.DataFrame(
        list(morphFrequency.items()),
        columns=["morph", "count"]
    )
    .sort_values("count", ascending=False)
    .reset_index(drop=True)
)

# Display DataFrame
print(df1)

              morph  count
0              CONJ  16316
1              PREP  10568
2               ADV   3808
3             N-NSM   3475
4             N-GSM   2935
...             ...    ...
1050      V-PEP-GPN      1
1051  V-2RAI-3S-ATT      1
1052     V-2RAP-NSF      1
1053       V-RDI-3S      1
1054        N-APN-S      1

[1055 rows x 2 columns]


Now let us check if all tags are decodable.

In [20]:
# now we can call the special test function

summary = check_morph_tags(
    tag_iterable    = df1['morph'],
    decoder         = morphkit.decode_tag,
    debug           = True,
    suppress_stdout = True,
    verbose         = True
)

# `summary` is a dict: {'tested':…, 'errors':…, 'warnings':…, 'unknowns':…}




## Investigating SP tag V-RAN-ATT

The test run initialy showed an error for SP tag `V-RAN-ATT`. According the [tag description](https://github.com/biblicalhumanities/Nestle1904/blob/master/morph/parsing.txt) the patern for a verb with mood='N' (imperative participle) should be:
```text
Patterns:
  V- tense voice N 
  
Mood N=infinitive
```

Stricly speaking the tag `V-RAN-ATT` does not comply to this pattern. So let us look it up where it occurs in the N1904-TF.

In [35]:
vranQuery='''
word morph=V-RAN-ATT
'''
vranResult = A.search(vranQuery)

  0.09s 1 result


In [32]:
A.show(vranResult, hiddenTypes={'wg','group','clause','subphrase','phrase','sentence'}, extraFeatures={'mood','case','number','gender'})

# Get all morphs from Morpheus

In [15]:
from collections import Counter
import pandas as pd

# Accumulate into a single Counter
grand = Counter()

# itterate over all wordnodes
for wordNode in F.otype.s("word"):
    for blockNumber in range(1, 9):
        lemma = Fs(f"ms{blockNumber}_lemma").v(wordNode)
        if not lemma:
            continue
        morph_string = Fs(f"ms{blockNumber}_morph").v(wordNode)
        if not morph_string:
            continue

        for tag in morph_string.split("/"):
            tag = tag.strip()
            if tag:
                grand[tag] += 1

# Build a DataFrame sorted by descending frequency
df2 = (
    pd.DataFrame(grand.items(), columns=["morph", "count"])
      .sort_values("count", ascending=False)
      .reset_index(drop=True)
)

print(df2)

          morph  count
0          CONJ  15194
1           ADV  13716
2          PREP  12797
3         N-PRI  10687
4      V-PAI-2S   6494
...         ...    ...
1005  V-RMP-GSF      1
1006  V-RMP-ASF      1
1007  V-RMP-GPM      1
1008  V-RMP-GPN      1
1009  V-RMP-GPF      1

[1010 rows x 2 columns]


In [23]:
# now we can call the special test function

summary = check_morph_tags(
    tag_iterable    = df2['morph'],
    decoder         = morphkit.decode_tag,
    debug           = True,
    suppress_stdout = True,
    verbose         = True
)

# `summary` is a dict: {'tested':…, 'errors':…, 'warnings':…, 'unknowns':…}

[Error]    'UNK':
    'Error': POS unknown
[Unknown]  'UNK':
    'Part of Speech': Unknown or Unsupported



# 4 - Conclusions <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

There seems to be a very few number of empty and unknows....

Further there are no more 'errors', 'warnings' or 'unknowns' reported any more on the morph tags found in the GNT

# 5 - Attribution and footnotes <a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

# 6 - Required libraries <a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

Since the scripts in this notebook utilize Text-Fabric, [it requires currently (Apr 2025) Python >=3.9.0](https://pypi.org/project/text-fabric) together with the following libraries installed in the environment:

    collections
    pandas
    contextlib
    os
    sys
    
You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 7 - Notebook version<a class="anchor" id="bullet7"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>19 May 2025</td>
    </tr>
  </table>
</div>