# Morpheus Morphological Parser & JSON Export

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - The code</a>
    * <a href="#bullet2x1">2.1 - Read the morpheus outputfile</a>
    * <a href="#bullet2x2">2.2 - Parse each entry and gather morphological elements</a>
    * <a href="#bullet2x3">2.3 - Run on a File & Export JSON</a>
    * <a href="#bullet2x4">2.4 - Run on a File & Export JSON</a>

# 1 - Introduction  <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This notebook reads the full Morpheus output (plain‑text format), parses each word’s block,
extracts all morphological details, classifies part of speech (POS), converts Beta Code to
Unicode, and finally exports one comprehensive JSON file mapping every input word to all
its parses.

The logic in this notebook is based upon my deconstruction of the morpheus output as detailed in [this document](decode_output.md)

# 2 - The code  <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)


The key parts of the code in this notebook are:

 - Read the morpheus outputfile and split the raw output into entries (one per word).
 - Parse each entry and gather every morphological element (lemma, stem, ending, tense, voice, mood, person, case, gender, number, degree, dialect, variant, inflection class, etc.).  
 - Build a Python dict (`parsed_data`) mapping each input word to a list of parse dictionaries, then  dump it as nicely‑formatted JSON.

The code is fully modular and commented for maintainability and expantion.

## 2.1 - Read the morpheus outputfile

This is part 1; it reads the morpheus outputfile and splits the raw output into entries (one per word). 

In [14]:
import pathlib

input_file = 'gnt_morphology_results.txt'   #  path to raw Morpheus output
assert pathlib.Path(input_file).exists(), f"Input file not found: {input_file}"

word_blocks = []
current_block = []
with open(input_file, encoding='utf-8') as f:
    for line in f:
        line = line.rstrip('\n')
        if not line:
            continue
        if line.startswith('Word:'):
            if current_block:
                word_blocks.append(current_block)
            current_block = [line]
        elif line.startswith('----------------'):
            if current_block:
                word_blocks.append(current_block)
            current_block = []
        else:
            current_block.append(line)
    if current_block:
        word_blocks.append(current_block)

print(f'Read {len(word_blocks)} word entries from {input_file}')

Read 19446 word entries from gnt_morphology_results.txt


## 2.2 - Parse each block and gather its elements <a class="anchor" id="bullet2x2"></a>

This is part 2 and it is realy the meat to the bone; the engine. Here we parse each entry and gather every morphological element (lemma, stem, ending, tense, voice, mood, person, case, gender, number, degree, dialect, variant, inflection class, etc.). It also gathers specific details on POS, when pressent. The most tricky parts is the chained if-elif section, which will be commented heavily in the code below.

In [15]:
import re

def parse_word_block(block):
    header=block[0]
    word_beta=header.split('Word:',1)[1].strip()
    if len(block)>1 and block[1].startswith('Error:'):
        return word_beta,[]
    parses=[]; current=None
    for line in block[1:]:
        if not line.startswith(':'): continue
        label=line.split()[0][1:]
        fields=line[len(label)+2:].split('\t')
        if label=='raw':
            current={'raw_beta':fields[0].strip()}
            parses.append(current)
        elif current is None: continue

        
# ---------------------------------------------------------------
# :workw  ― “working word” (Morpheus-normalised token)
# ---------------------------------------------------------------
        
        elif label == "workw":
            # Column-0 holds the token that Morpheus actually analysed.
            # Typical differences from :raw:
            #   a leading asterisk (capital mark) may be removed
            #   accents might be regularised
            #   elision/apostrophe may be expanded when Morpheus is run with the –S switch
            work_token = fields[0].strip()           # e.g.  le/gete  vs  *le/gete
            current["work_beta"] = work_token        # keep Beta-Code form

            # Unicode copy for readability
            current["work_unicode"] = beta_code.beta_code_to_greek(work_token)

        
# ---------------------------------------------------------------
# :lem  ― handle lemma, homonym tag, and Unicode copies
# ---------------------------------------------------------------
        
        elif label == "lem":
            # col-0  = lemma in Beta-Code, possibly with numeric suffix
            #          e.g.  'le/gw1'  or  'h)/2'
            lemma_field = fields[0].strip()

            # Always keep the *full* lemma (including any numeric sense tag)
            current["lemma_beta"] = lemma_field

            # -----------------------------------------------------------
            # Detect homonymous lemmas   (pattern: <lemma><digits>)
            #   group(1) → base lemma   (le/gw)
            #   group(2) → digits       (1)
            # -----------------------------------------------------------
            m = re.match(r"^(.*?)(\d+)$", lemma_field)
            if m:
                current["lemma_base"] = m.group(1)          # le/gw
                current["homonym"]    = int(m.group(2))     # 1  (as int)
            else:
                # No numeric suffix → single-sense lemma
                current["lemma_base"] = lemma_field

            # -----------------------------------------------------------
            # Unicode copies for readability
            # -----------------------------------------------------------
            current["lemma_unicode"]       = beta_code.beta_code_to_greek(lemma_field)
            current["lemma_base_unicode"]  = beta_code.beta_code_to_greek(
                current["lemma_base"]
            )

# ------------------------------------------------------------------
# :prvb / :aug1 / :suff   ― 5-column layout
#   col-0  ⇒ preverb / augment / suffix  (betacode segment)
#   col-2  ⇒ dialect                    (may list several)
#   col-3  ⇒ morph-flags                (space-separated)
# ------------------------------------------------------------------
        
        elif label in {"prvb", "aug1", "suff"}:
            key_map = {"prvb": "preverb_beta",
                       "aug1": "augment_beta",
                       "suff": "suffix_beta"}
            key = key_map[label]
        
            # --- col-0: β-code segment (may be empty) ----------------------
            segment = fields[0].strip()
            if segment:
                # For :prvb you can get multiple prepositions separated by commas
                # (e.g. "dia/,kata/").  Store as list for convenience; otherwise keep str.
                current[key] = (
                    [s for s in re.split(r"[ ,]+", segment) if s] if label == "prvb" else segment
                )
        
            # --- col-2: dialect(s) ----------------------------------------
            if len(fields) > 2 and fields[2].strip():
                current.setdefault("dialects", []).extend(fields[2].strip().split())
        
            # --- col-3: morph-flags ---------------------------------------
            if len(fields) > 3 and fields[3].strip():
                current.setdefault("morph_flags", []).extend(fields[3].strip().split())

# ---------------------------------------------------------------
# :stem  ― 5-column layout
#   col-0  stem segment  (β-code)            → stem_beta / stem_unicode
#   col-1  inherent morpho-syntax            → gender / number / case …
#   col-2  dialect(s)                        → dialects  (list)
#   col-3  morph-flags                       → morph_flags  (list)
#   col-4  stem-type / paradigm code(s)      → morph_codes  (list)
# ---------------------------------------------------------------
        
        elif label == "stem":

            # ── col-0: stem segment (always first) ────────────────────
            stem_segment = fields[0].strip()
            if stem_segment:
                current["stem_beta"]    = stem_segment
                current["stem_unicode"] = beta_code.beta_code_to_greek(stem_segment)

            # ── col-1: inherent morpho-syntax (optional) ──────────────
            #     Examples: "fem", "masc sg", "mp", "nom/voc sg"
            if len(fields) > 1 and fields[1]:
                for tok in fields[1].split():
                    if tok in {"masc", "fem", "neut"}:
                        current["gender"] = tok
                    elif tok in {"sg", "pl", "dual"}:
                        current["number"] = tok
                    elif tok in {"nom", "acc", "gen", "dat", "voc"}:
                        current["case"] = tok
                    else:
                        # Catch-all: any uncommon token (e.g. "mp", "indeclform")
                        # is treated as an extra morph-flag.
                        current.setdefault("morph_flags", []).append(tok)

            # ── col-2: dialect markers (optional) ─────────────────────
            if len(fields) > 2 and fields[2].strip():
                current.setdefault("dialects", []).extend(
                    fields[2].strip().split()
                )

            # ── col-3: other morph-flags (optional) ───────────────────
            if len(fields) > 3 and fields[3].strip():
                current.setdefault("morph_flags", []).extend(
                    fields[3].strip().split()
                )

            # ── col-4: stem-type / paradigm codes (optional) ──────────
            #     E.g. "os_h_on", "aor2", "mi_pr"  (comma-separated list)
            if len(fields) > 4 and fields[4].strip():
                codes = [c.strip() for c in fields[4].split(",") if c.strip()]
                current.setdefault("morph_codes", []).extend(codes)

      
# ---------------------------------------------------------------
# :end  ― 5-column layout (most detailed line)
#   col-0  ending segment (β-code)                 → ending_beta / ending_unicode
#   col-1  full morphological feature string       → tense / mood / …  (parsed below)
#   col-2  dialect(s)  (optional)                  → dialects   (list)
#   col-3  morph-flags (optional)                  → morph_flags  (list)
#   col-4  paradigm / POS / extra codes (optional) → morph_codes  (list)
# ---------------------------------------------------------------
        
        elif label == "end":

            # ── col-0: ending segment ─────────────────────────────────
            if fields[0]:
                ending_seg = fields[0].strip()
                current["ending_beta"]    = ending_seg
                current["ending_unicode"] = beta_code.beta_code_to_greek(ending_seg)

            # ── col-1: morphological feature tokens ──────────────────
            if len(fields) > 1 and fields[1]:
                for tok in fields[1].split():
                    tl = tok.lower()            # normalised for matching

                    # ----------------- verbal tense ------------------
                    if tl in {"pres", "present"}:
                        current["tense"] = "present"
                    elif tl in {"imperf", "imperfect"}:
                        current["tense"] = "imperfect"
                    elif tl in {"fut", "future"}:
                        current["tense"] = "future"
                    elif tl in {"aor", "aorist"}:
                        current["tense"] = "aorist"
                    elif tl in {"perf", "perfect"}:
                        current["tense"] = "perfect"
                    elif tl in {"plup", "pluperfect"}:
                        current["tense"] = "pluperfect"

                    # ----------------- verbal mood -------------------
                    elif tl in {"ind", "indicative"}:
                        current["mood"] = "indicative"
                    elif tl in {"subj", "subjunctive"}:
                        current["mood"] = "subjunctive"
                    elif tl in {"opt", "optative"}:
                        current["mood"] = "optative"
                    elif tl in {"imperat", "imperative"}:
                        current["mood"] = "imperative"
                    elif tl in {"inf", "infinitive"}:
                        current["mood"] = "infinitive"
                    elif tl in {"part", "participle"}:
                        current["mood"] = "participle"

                    # ----------------- verbal voice ------------------
                    elif tl in {"act", "active"}:
                        current["voice"] = "active"
                    elif tl in {"mid", "middle"}:
                        current["voice"] = "middle"
                    elif tl in {"pass", "passive"}:
                        current["voice"] = "passive"
                    elif tl == "mp":
                        current["voice"] = "middle/passive"

                    # ---------------- person / number ----------------
                    elif tl in {"sg", "pl", "dual"}:
                        current["number"] = tl
                    elif re.match(r"^[123]$", tl):   # 1, 2, 3
                        current["person"] = tl

                    # ---------------- gender / case ------------------
                    elif tl in {"masc", "fem", "neut"}:
                        current["gender"] = tl
                    elif tl in {"nom", "acc", "gen", "dat", "voc"}:
                        # Handle combined cases like nom/voc/acc
                        if "/" in tok:
                            current.setdefault("case", []).extend(tok.split("/"))
                        else:
                            current["case"] = tl

                    # --------------- degree (comparatives) -----------
                    elif tl in {"comp", "comparative"}:
                        current["degree"] = "comparative"
                    elif tl in {"sup", "superlative"}:
                        current["degree"] = "superlative"

                    # --------------- fallback: keep unknowns ---------
                    else:
                        # Anything unrecognised is saved so nothing is lost
                        current.setdefault("other_end_tokens", []).append(tok)

            # ── col-2: dialects ──────────────────────────────────────
            if len(fields) > 2 and fields[2].strip():
                current.setdefault("dialects", []).extend(
                    fields[2].strip().split()
                )

            # ── col-3: morph-flags (e.g. contr, enclitic) ────────────
            if len(fields) > 3 and fields[3].strip():
                current.setdefault("morph_flags", []).extend(
                    fields[3].strip().split()
                )

            # ── col-4: paradigm / POS / extra codes ──────────────────
            if len(fields) > 4 and fields[4].strip():
                codes = [c.strip() for c in fields[4].split(",") if c.strip()]
                current.setdefault("morph_codes", []).extend(codes)
                
    return word_beta, parses

## 2.3 - Parse all blocks <a class="anchor" id="bullet2x3"></a>

Recursive call the previous function to process all blocks.

In [18]:
parsed_data={}
for blk in word_blocks:
    w,p=parse_word_block(blk)
    parsed_data[w]=p

In [19]:
# Just to check the output
from itertools import islice
sample = dict(islice(parsed_data.items(), 3))
for w, parses in sample.items():
    print(w, "->", parses[:2], "\n\n")   # prints only the first three parses for brevity

kai\ -> [{'raw_beta': 'kai\\', 'work_beta': 'kai/', 'work_unicode': 'καί', 'lemma_beta': 'kai/', 'lemma_base': 'kai/', 'lemma_unicode': 'καί', 'lemma_base_unicode': 'καί', 'stem_beta': 'kai/', 'stem_unicode': 'καί', 'morph_flags': ['indeclform', 'indeclform'], 'morph_codes': ['conj']}] 


o( -> [{'raw_beta': 'o(', 'work_beta': 'o(', 'work_unicode': 'ὁ', 'lemma_beta': 'o(', 'lemma_base': 'o(', 'lemma_unicode': 'ὁ', 'lemma_base_unicode': 'ὁ', 'stem_beta': 'o(', 'stem_unicode': 'ὁ', 'morph_flags': ['proclitic', 'indeclform', 'proclitic', 'indeclform'], 'gender': 'masc', 'case': 'nom', 'number': 'sg', 'morph_codes': ['article']}] 


e)n -> [{'raw_beta': 'e)n', 'work_beta': 'e)n', 'work_unicode': 'ἐν', 'lemma_beta': 'e)n', 'lemma_base': 'e)n', 'lemma_unicode': 'ἐν', 'lemma_base_unicode': 'ἐν', 'stem_beta': 'e)n', 'stem_unicode': 'ἐν', 'morph_flags': ['proclitic', 'indeclform', 'proclitic', 'indeclform'], 'morph_codes': ['prep']}, {'raw_beta': 'e)n', 'work_beta': 'e)n', 'work_unicode': 'ἐν',

## 2.4 - Unicode and POS <a class="anchor" id="bullet2x4"></a>

Perform the Unicode conversion and determing the POS.

In [24]:
import beta_code

# I have to realy make this uniform! T
pos_map={'conj':'conj','adv':'adverb','prep':'prep'}
pos_labels={'noun','verb','adjective','adverb','conj','prep',
            'pron','part','art','participle'}
for w,plist in parsed_data.items():
    for p in plist:
        
        # Convert Betacode to Unicode for key fields
        for k in [
            "raw_beta", "work_beta", "lemma_beta", "preverb_beta", "augment_beta",
            "stem_beta", "suffix_beta", "ending_beta"
        ]:
            if k not in p:
                continue
        
            value = p[k]
        
            if isinstance(value, list):
                # Convert every element in the list
                p[k.replace("_beta", "_unicode")] = [
                    beta_code.beta_code_to_greek(x) for x in value
                ]
            elif isinstance(value, str):
                # Single string → single string
                p[k.replace("_beta", "_unicode")] = beta_code.beta_code_to_greek(value)
        
        # determening the POS (still to be tweaked...)
        pos=None
        for code in p.get('morph_codes',[]):
            if code.lower() in pos_map: pos=pos_map[code.lower()]
            elif code.lower() in pos_labels: pos=code.lower()
            # order matters here????
            if code.lower()=='article': pos='art'
            if code.lower()=='particle': pos='part'
        if pos is None:
            if any(k in p for k in ('tense','mood','voice','person')):
                pos='participle' if p.get('mood')=='participle' or 'case' in p else 'verb'
            elif any(k in p for k in ('case','gender','number')):
                pos='article' if 'article' in p.get('morph_codes',[]) else 'noun'
            else:
                pos='part'
        p['POS']=pos


## Check the results

The following code gives a quick review & some statistics on the produced data.

In [25]:
import pandas as pd
from collections import Counter
from itertools import islice

# Flatten the parsed_data by converting to a list[dict]
flat_rows = []
for beta_word, parses in parsed_data.items():
    for p in parses:
        row = {
            "word_beta"   : beta_word,
            "word_unicode": beta_code.beta_code_to_greek(beta_word),
            "POS"         : p.get("POS")
        }
        row.update(p)          # pull all parse fields in
        flat_rows.append(row)

df = pd.DataFrame(flat_rows)

# Peek at the first few rows
print("\n=== First 5 parses ===")
print(df.head(5).to_string(index=False))

# Get some idea about POS distribution
print("\n=== POS distribution (value counts) ===")
print(df["POS"].value_counts(dropna=False))

# Get the most-frequent Case/Number/Gender combinations for  rows that actually have those three
mask = df[["case", "number", "gender"]].notna().all(axis=1)
cng_counts = Counter(
    tuple(df.loc[idx, ["case", "number", "gender"]])
    for idx in df[mask].index
)
print("\n=== Top 10 case/number/gender patterns ===")
for (case, num, gen), freq in islice(cng_counts.most_common(10), 10):
    print(f"{case:>4}/{num:<7}/{gen:<4}  → {freq:>6}")


=== First 5 parses ===
word_beta word_unicode  POS raw_beta work_beta work_unicode lemma_beta lemma_base lemma_unicode lemma_base_unicode stem_beta stem_unicode                                    morph_flags morph_codes raw_unicode gender case number        dialects other_end_tokens ending_beta ending_unicode  homonym tense mood voice augment_beta augment_unicode preverb_beta preverb_unicode degree
     kai\          καὶ conj     kai\      kai/          καί       kai/       kai/           καί                καί      kai/          καί                       [indeclform, indeclform]      [conj]         καὶ    NaN  NaN    NaN             NaN              NaN         NaN            NaN      NaN   NaN  NaN   NaN          NaN             NaN          NaN             NaN    NaN
       o(            ὁ  art       o(        o(            ὁ         o(         o(             ὁ                  ὁ        o(            ὁ [proclitic, indeclform, proclitic, indeclform]   [article]           ὁ   masc  n

## 2.5 - Export to JSON <a class="anchor" id="bullet2x5"></a>

In [26]:
import json

out_path='morpheus_parses.json'
with open(out_path,'w',encoding='utf-8') as jf:
    json.dump(parsed_data,jf,ensure_ascii=False,indent=2)
print(f'Wrote JSON to {out_path} ({len(parsed_data)} words).')

Wrote JSON to morpheus_parses.json (19446 words).


# 3 - Acknowledgements <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook used the following sources for the analysis and implementation:

- [Morpheus Morphological Analyzer (Perseus Project)](https://github.com/perseids-tools/morpheus/)
- [Greek Beta Code standard](https://stephanus.tlg.uci.edu/encoding/BCM.pdf)
- [beta-code-py](https://github.com/perseids-tools/beta-code-py)

The [Anaconda Asisstant](https://www.anaconda.com/capability/anaconda-assistant) (using [OpenAI](https://openai.com/) as backend) was used to debug and/or optimze the code in this Jupyter Notebook.

# 4 - Required libraries <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require Python 3.8+ and the following libraries to be installed in the environment:

``` python
    beta_code
    json
    pathlib
```

You can install any missing library from within Jupyter Notebook using either `pip` or `pip3`.

# 5 - Notebook version details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.2</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>30 April 2025</td>
    </tr>
  </table>
</div>