# Check tags in N1904-TF and Morpheus

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Setting up testing environment</a>
    * <a href="#bullet2x1">2.1 - Load N1904-TF</a>
    * <a href="#bullet2x2">2.2 - Load morphkit</a>
* <a href="#bullet3">3 - Run the tests</a>    
    * <a href="#bullet3x1">3.1 - Test decoding of some morph tags</a>
    * <a href="#bullet3x2">3.2 - Run  automated test with all morphs in the GNT</a>
* <a href="#bullet4">4 - Conclusion</a>
* <a href="#bullet5">5 - Attribution and footnotes</a>
* <a href="#bullet6">6 - Required libraries</a>
* <a href="#bullet7">7 - Notebook version</a>


#  1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

why...

# 2 - Setting up testing environment <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

Here we load the required resources.

## 2.1 - Load N1904-TF with N1904Addons<a class="anchor" id="bullet2x1"></a>

We can just load the plain N1904-TF dataset with the additional dataset N1904Addons since we would like to test the 
morph tags derived from Morpheus.

In [1]:
# Load the autoreload extension to automatically reload modules before executing code
# This is useful during development so you don't have to restart the kernel
%load_ext autoreload
%autoreload 2

In [2]:
# Import required modules, including Text-Fabric (tf)
from tf.app import use
import os
import sys

In [3]:
# Load the N1904-TF app and data 
A = use ("CenterBLC/N1904", version="1.0.0", mod="tonyjurg/N1904addons/tf/", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [5]:
# Essential: we need to add the relative path to the DIR morphkit (which contains the actual module files) 
import sys
sys.path.insert(0, "../../morphkit")    # relative to notebook dir

import morphkit

# 3 - Run the tests <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

## 3.1 - Get all morphs from the N1904-TF <a class="anchor" id="bullet3x1"></a>

In [6]:
from collections import Counter
import pandas as pd

#  Accumulate into a single Counter
morphFrequency = Counter()

# itterate over all wordnodes
for wordNode in F.otype.s("word"):
    morph = F.morph.v(wordNode)    
    if not morph:
        continue
    morphFrequency[morph] += 1

# Build a DataFrame sorted by descending frequency
df1 = (
    pd.DataFrame(
        list(morphFrequency.items()),
        columns=["morph", "count"]
    )
    .sort_values("count", ascending=False)
    .reset_index(drop=True)
)

# Display DataFrame
print(df1)


              morph  count
0              CONJ  16316
1              PREP  10568
2               ADV   3808
3             N-NSM   3475
4             N-GSM   2935
...             ...    ...
1050      V-PEP-GPN      1
1051  V-2RAI-3S-ATT      1
1052     V-2RAP-NSF      1
1053       V-RDI-3S      1
1054        N-APN-S      1

[1055 rows x 2 columns]


# Get all morphs from Morpheus

In [8]:
from collections import Counter
import pandas as pd

# Accumulate into a single Counter
grand = Counter()

# itterate over all wordnodes
for wordNode in F.otype.s("word"):
    for blockNumber in range(1, 9):
        lemma = Fs(f"ms{blockNumber}_lemma").v(wordNode)
        if not lemma:
            continue
        morph_string = Fs(f"ms{blockNumber}_morph").v(wordNode)
        if not morph_string:
            continue

        for tag in morph_string.split("/"):
            tag = tag.strip()
            if tag:
                grand[tag] += 1

# Build a DataFrame sorted by descending frequency
df2 = (
    pd.DataFrame(grand.items(), columns=["morph", "count"])
      .sort_values("count", ascending=False)
      .reset_index(drop=True)
)

print(df2)

          morph  count
0          CONJ  15194
1           ADV  13716
2          PREP  12797
3         N-PRI  10687
4      V-PAI-2S   6494
...         ...    ...
1005  V-RMP-GSF      1
1006  V-RMP-ASF      1
1007  V-RMP-GPM      1
1008  V-RMP-GPN      1
1009  V-RMP-GPF      1

[1010 rows x 2 columns]


# 4 - Attribution and footnotes <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

Greek base text: Nestle1904 Greek New Testament, edited by Eberhard Nestle, published in 1904 by the British and Foreign Bible Society. Transcription by [Diego Santos](https://sites.google.com/site/nestle1904/home). Public domain.

Betacode syntax follows the TLG/Perseus convention: [Thesaurus Linguae Graecae / Perseus Project spec.](https://stephanus.tlg.uci.edu/encoding/BCM.pdf)

The conversion code between Unicode and Betacode is available at [GitHub repository perseids-tools/beta-code-py](https://github.com/perseids-tools/beta-code-py).

The [N1904-TF dataset](https://centerblc.github.io/N1904/) available under [MIT licence](https://github.com/CenterBLC/N1904/blob/main/LICENSE.md). Formal reference: Tony Jurg, Saulo de Oliveira Cantanhêde, & Oliver Glanz. (2024). *CenterBLC/N1904: Nestle 1904 Text-Fabric data*. Zenodo. DOI: [10.5281/zenodo.13117911](https://doi.org/10.5281/zenodo.13117910).

The Text-Fabric features created in this notebook were added to the dataset published at [tonyjurg.github.io/N1904addons](https://tonyjurg.github.io/N1904addons/) and made available under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md) license.

The [Anaconda Asisstant](https://www.anaconda.com/capability/anaconda-assistant) (using [OpenAI](https://openai.com/) as backend) was used to debug and/or optimze the code in this Jupyter Notebook.

# 5 - Required libraries<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

Since the scripts in this notebook utilize Text-Fabric, [it requires currently (Apr 2025) Python >=3.9.0](https://pypi.org/project/text-fabric) together with the following libraries installed in the environment:

    beta_code
    unicodedata
    
You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 6 - Notebook version<a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.2</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>2 May 2025</td>
    </tr>
  </table>
</div>