# Morpheus Morphological Extractor \[OBSOLETE]

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - The code</a>
    * <a href="#bullet2x1">2.1 - Read the morpheus outputfile</a>
    * <a href="#bullet2x2">2.2 - Parse each entry and gather morphological elements</a>
    * <a href="#bullet2x3">2.3 - Run on a File & Export JSON</a>
    * <a href="#bullet2x4">2.4 - Run on a File & Export JSON</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This notebook extracts the morphological information from Morpheus output files and saves it
in a structured JSON format. The code is NOT trying to guess or interpret part‑of‑speech (POS).  

Note:
> This notebook is now OBSOLETE because the functions in here are now much cleaner implemented in the package [morphkit](https://tonyjurg.github.io/morphkit/). Althoug in part based on the code in this notebook, the new package contains additinal functionality and various corrections.

# 2 - The code <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

The key parts of the code in this notebook are:

 - Read the morpheus outputfile and split the raw output into entries (one per word).
 - Parse each entry and gather every morphological element (lemma, stem, ending, tense, voice, mood, person, case, gender, number, degree, dialect, variant, inflection class, etc.).  
 - Build a Python dict (`parsed_data`) mapping each input word to a list of parse dictionaries, then  dump it as nicely‑formatted JSON.

The code is fully modular and heavily commented for maintainability and expantion.

## 2.1 - Read the morpheus outputfile

This is part 1; it reads the morpheus outputfile and splits the raw output into entries (one per word). 

In [1]:
import beta_code
import json
from pathlib import Path

def read_morpheus_file(file_path):
    """
    Read a Morpheus output text file and split it into entries (one per analysed word).

    Parameters
    ----------
    file_path : str or Path
        Path to the Morpheus output file.

    Returns
    -------
    list[list[str]]
        A list; each element is a list of lines (strings) belonging to one word entry.
    """
    entries = []
    current_entry = []
    with open(file_path, "r", encoding="utf-8") as fh:
        for line in fh:
            if line.strip().startswith("-----"):   # separator line
                if current_entry:
                    entries.append(current_entry)
                current_entry = []
            else:
                current_entry.append(line.rstrip("\n"))
    if current_entry:
        entries.append(current_entry)
    return entries



## 2.2 - Parse each entry and gather morphological elements <a class="anchor" id="bullet2x2"></a>

This is part 2 and it is the meat to the bone; the engine. Here we parse each entry and gather every morphological element (lemma, stem, ending, tense, voice, mood, person, case, gender, number, degree, dialect, variant, inflection class, etc.).

In [2]:
def parse_entry(entry_lines):
    """
    Parse one word entry from Morpheus output and return (word, parses).

    Parameters:
        entry_lines : list[str]
            Lines belonging to a single entry.

    Returns:
        tuple[str, list[dict]]
            word_beta : The queried word in Beta Code.
            parses    : List of dicts, each containing morphological data.
    """
    if not entry_lines or not entry_lines[0].startswith("Word:"):
        return None

    word_beta = entry_lines[0].split(":", 1)[1].strip()
    for ln in entry_lines[1:]:
        if ln.strip().startswith("Error:"):
            return word_beta, []

    parses = []
    current = None

    for ln in entry_lines[1:]:
        if not ln.strip():
            continue

        # ---------------------------------------------------------------
        # :raw  – obtain the raw form
        # ---------------------------------------------------------------
        if ln.startswith(":raw"):
            if current is not None:
                parses.append(current)
            current = {}
            continue

        if current is None:
            continue

        # ---------------------------------------------------------------
        # :lem  – keep full lemma, split off numeric homonym, add Unicode
        # ---------------------------------------------------------------
        if ln.startswith(":lem"):
            lemma_field = ln.split(" ", 1)[1].strip()        # e.g.  le/gw1
            current["lemma_beta"] = lemma_field              # full form with suffix

            # Separate base-lemma and numeric homonym tag (if any)
            m = re.match(r"^(.*?)(\d+)$", lemma_field)
            if m:
                current["lemma_base"] = m.group(1)           # le/gw
                current["homonym"]    = int(m.group(2))      # 1
            else:
                current["lemma_base"] = lemma_field          # no numeric suffix

            # Unicode copies
            current["lemma_unicode"]       = beta_code.beta_code_to_greek(lemma_field)
            current["lemma_base_unicode"]  = beta_code.beta_code_to_greek(current["lemma_base"])

        elif ln.startswith(":prvb"):
            pv = ln.partition(" ")[2].strip()
            if pv:
                current["prefix"] = pv

        elif ln.startswith(":aug1"):
            aug = ln.partition(" ")[2].strip()
            if aug:
                current["augment"] = aug

        elif ln.startswith(":stem"):
            content = ln.partition(" ")[2].strip()
            if content:
                parts = content.split()
                current["stem"] = parts[0]
                if len(parts) > 1:
                    cls = parts[-1]
                    if "_" in cls or "," in cls:
                        codes = [c.strip() for c in cls.split(",") if c.strip()]
                        current["inflection_class"] = codes if len(codes) > 1 else codes[0]

        elif ln.startswith(":suff"):
            suff = ln.partition(" ")[2].strip()
            if suff:
                current["suffix"] = suff

        elif ln.startswith(":end"):
            content = ln.partition(" ")[2].strip()
            if not content:
                continue

            toks = content.split()
            current["ending"] = toks[0]
            morph_toks = toks[1:]

            # Split off trailing pattern codes (contain _ or ,)
            pattern = []
            for i, tok in enumerate(morph_toks):
                if "_" in tok or "," in tok:
                    pattern = morph_toks[i:]
                    morph_toks = morph_toks[:i]
                    break
            if pattern:
                codes = [c for tok in pattern for c in tok.split(",") if c]
                if codes:
                    if "inflection_class" in current:
                        exist = current["inflection_class"]
                        if not isinstance(exist, list):
                            exist = [exist]
                        for c in codes:
                            if c not in exist:
                                exist.append(c)
                        current["inflection_class"] = exist if len(exist) > 1 else exist[0]
                    else:
                        current["inflection_class"] = codes if len(codes) > 1 else codes[0]

            VALID_CASE = {"nom","gen","dat","acc","voc"}
            VALID_GEND = {"masc","fem","neut"}

            cat = {}
            for tok in morph_toks:
                t = tok.lower()

                # gender
                if all(p in VALID_GEND for p in t.split("/")):
                    g = t.split("/")
                    cat["gender"] = g if len(g) > 1 else g[0]
                    continue
                # case
                if all(p in VALID_CASE for p in t.split("/")):
                    c = t.split("/")
                    cat["case"] = c if len(c) > 1 else c[0]
                    continue
                # number
                if t in {"sg","singular","pl","plural"}:
                    cat["number"] = "singular" if t.startswith("s") else "plural"
                    continue
                # tense
                tense_map = {"pres":"present","present":"present",
                             "imperf":"imperfect","imperfect":"imperfect",
                             "fut":"future","future":"future",
                             "aor":"aorist","aorist":"aorist",
                             "perf":"perfect","perfect":"perfect",
                             "plup":"pluperfect","pluperfect":"pluperfect"}
                if t in tense_map:
                    cat["tense"] = tense_map[t]
                    continue
                # mood
                mood_map = {"ind":"indicative","indicative":"indicative",
                            "subj":"subjunctive","subjunctive":"subjunctive",
                            "opt":"optative","optative":"optative",
                            "imperat":"imperative","imperative":"imperative",
                            "inf":"infinitive","infinitive":"infinitive",
                            "part":"participle","participle":"participle"}
                if t in mood_map:
                    cat["mood"] = mood_map[t]
                    continue
                # voice
                voice_map = {"act":"active","active":"active",
                             "mid":"middle","middle":"middle",
                             "pass":"passive","passive":"passive",
                             "mid/pass":"middle/passive","mp":"middle/passive",
                             "act/pass":"active/passive"}
                if t in voice_map:
                    cat["voice"] = voice_map[t]
                    continue
                # person
                if t in {"1st","2nd","3rd"}:
                    cat["person"] = int(t[0])
                    continue
                # degree
                if t in {"comp","comparative"}:
                    cat["degree"] = "comparative"
                    continue
                if t in {"superl","superlative"}:
                    cat["degree"] = "superlative"
                    continue
                # dialect or variant
                dialects = {"attic","doric","ionic","aeolic","epic","poetic"}
                variants = {"contr","enclitic","proclitic","irreg"}
                if t in dialects:
                    cat.setdefault("dialect", []).append(t)
                elif t in variants:
                    cat.setdefault("variant", []).append(t)
                else:
                    cat.setdefault("other", []).append(tok)

            current.update(cat)

    if current is not None:
        parses.append(current)
    return word_beta, parses

## 2.3 - Run on a File & Export JSON <a class="anchor" id="bullet2x3"></a>

This short function build a Python dict mapping each input word to a list of parse dictionaries, then dump it as nicely‑formatted JSON to a file (name provided at runtime).


In [3]:
def morpheus_to_json(input_path: str | Path, output_path: str | Path | None = None):
    """Parse *input_path* and save JSON to *output_path* (or return dict)."""
    entries = read_morpheus_file(input_path)
    data = {}
    for ent in entries:
        res = parse_entry(ent)
        if res:
            word, parses = res
            data[word] = parses

    if output_path:
        with open(output_path, "w", encoding="utf-8") as fh:
            json.dump(data, fh, ensure_ascii=False, indent=2)
    return data

## 2.4 - Running the extraction <a class="anchor" id="bullet2x4"></a>

The following lines of code execute the functions defined in the previous sections. Before execution it checks if the input file can be located. After analysis, the first few items are also dumped.

In [4]:
from pathlib import Path
import json

input_file = Path("gnt_morphology_results.txt")
output_file = Path("gnt_morphology_results.json")

if input_file.exists():
    print(f"Parsing {input_file} ...")
    parsed = morpheus_to_json(input_file, output_file)
    print(f"Saved JSON to {output_file}\n")

    # Print the first two items for a quick sanity check
    first_two = list(parsed.items())[:3]
    print("First two items in parsed JSON:")
    for word, parses in first_two:
        print(json.dumps({word: parses}, ensure_ascii=False, indent=2))
else:
    print(
        f"Input file '{input_file}' not found."
    )


Parsing gnt_morphology_results.txt ...
Saved JSON to gnt_morphology_results.json

First two items in parsed JSON:
{
  "agnwstw": []
}
{
  "anaqema": []
}
{
  "ai)gupti/wn": []
}


# 3 - Acknowledgements <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook used the following sources for the analysis and implementation:

- [Morpheus Morphological Analyzer (Perseus Project)](https://github.com/perseids-tools/morpheus/)
- [Greek Beta Code standard](https://stephanus.tlg.uci.edu/encoding/BCM.pdf)
- [beta-code-py](https://github.com/perseids-tools/beta-code-py)

# 4 - Required libraries <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require Python 3.8+ and the following libraries to be installed in the environment:

``` python
    beta_code
    json
    pathlib
```

You can install any missing library from within Jupyter Notebook using either `pip` or `pip3`.

# 5 - Notebook version details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.2</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>30 April 2025</td>
    </tr>
  </table>
</div>