# Morpheus Morphological Parser (versie 16 mei)

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
    * <a href="#bullet1x1">1.1 - Production environment</a>
    * <a href="#bullet1x2">1.2 - Test environment</a>
* <a href="#bullet2">2 - Setting up</a>
    * <a href="#bullet2x1">2.1 - Determine the main classes of Part of Speech</a>
    * <a href="#bullet2x2">2.2 - Determine the SP morphological tag</a>
    * <a href="#bullet2x3">2.3 - Gather details from Morpheus blocks</a>
    * <a href="#bullet2x4">2.4 - Analyzing a single word</a>
* <a href="#bullet3">3 - Production</a>
    * <a href="#bullet3x1">3.1 - Load the input words</a>
    * <a href="#bullet3x2">3.2 - Running the production</a>
* <a href="#bullet4">4 - Validation</a>
    * <a href="#bullet4x1">4.1 - Load the test files</a>
    * <a href="#bullet4x2">4.2 - Run the test files</a>
* <a href="#bullet5">5 - Atribution and footnotes</a>
* <a href="#bullet6">6 - Required libraries</a>
* <a href="#bullet7">7 - Notebook version</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

The parsing process is distributed over a few cells in this notebook. The actual start of the analysis starts in section [bla], but all prior cells need to be executed before. When in cell [b;a], the analysis begins with a simple Betacode string (the Greek word rendered in ASCII) which were read from a plain text file (in step []).

Each word is then passed off to the `analyze_word_with_morpheus` function. That function is responsible for encoding the word into a URL-safe form, sending an HTTP request to the local Morpheus server running in its virtualized container. From this we retrieve the raw, line‐oriented output. As soon as the response arrives, it is split on every line that begins with `:raw`, producing discrete “blocks” of analysis, each of which corresponds to one of parses by Morpheus.

Each block is then fed into `parse_word_block` which walks through all of the output lines. It will looking labels like `:lem`, `:stem`, and `:end`. From this it assembles a rich dictionary of morphological features for every possible parse. Once those raw features are in place, two post‐processing routines are called. First, `analyze_pos` examines the feature dictionary to assign a part of speech label. For this it will check for keywords, indeclinable forms (turning neuter‐singular indeclinables into adverbs and the rest into particles), verbal markers like tense or mood, and a host of other morph‐code clues. Next, `analyze_morph_tag` constructs the standardized Robinson/Pierpont tag. It builds upon the previous POS analysis and add to the prefix items like tense, voice, mood, person and number (for verbs) or case/number/gender/degree (for nouns and adjectives). The function then returns a very compact code like `V‐PAP‐DSM` or `A‐NSM`.

## 1.1 - Production <a class="anchor" id="bullet1x1"></a>

Production!

## 1.2 - Test <a class="anchor" id="bullet1x2"></a>

Beside a production type of section, there is also a test section. In this section I load a tab-separated file of word-and-expected-tag pairs. This will be used by `funct` to call `analyze_word_with_morpheus` for each word found in the test set. From the results of this function, it gathers the  multi‐tag strings (which are SP tags joined by “/”). This string is then flattened in a simple deduplicated list using function `flatten_tags`. This allows to easily checks whether the expected tag (which was taken from N1904-TF) appears in that list. A counter is updated recording the amount of match/no-match. Since there is a significant difference in word clasification between SP and Morpheus, a 'no-match' does not always indicate a fail. A small HTML‐generation routine stitches together the raw Morpheus blocks (formatted by `format_string`) and the escaped JSON of parsed dictionaries. Colours are added to visualize the match/no-match results. The final product of the test is a standalone HTML report that shows for every input word exactly what Morpheus returned, how it was parsed and tagged, and whether it matched with the SP tag found in N1904-TF.


# 2 - Setting up <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

## 2.1 - Load the morphkit library <a class="anchor" id="bullet2x1"></a>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.insert(0, "../../morphkit")    # relative to notebook dir
import morphkit

morphkit loaded


In [3]:
# 1. Analyze a word with Morpheus
result=morphkit.analyze_word_with_morpheus('th\\n', 'http://10.0.1.156:1315/greek/',debug=False)
# result['analyses'] is a list of parse dicts

In [4]:
# 2. Inspect part of speech and SP tag for each analysis
for parse in result['analyses']:
    pos_label = morphkit.analyze_pos(parse)
    print (f'pos_label={pos_label}')
    sp_tag    = morphkit.analyze_morph_tag(parse)
    print(f'sp_tag={sp_tag}')  # WARNING: this is a compound???
    decoded   = morphkit.decode_tag(sp_tag)
    print(f'decoded={decoded}')

pos_label=article
sp_tag=T-ASF
decoded={'Part of Speech': 'Article', 'Case': 'Accusative', 'Number': 'Singular', 'Gender': 'Feminine'}


In [5]:
# 3. Compare two SP tags
score = morphkit.compare_tags('N-NSN-ATT', 'N-ASM-ATT', debug=True)
print (f'\nscore={score}')
print(f"\nSimilarity: {score['overall_similarity']}")

[compare_tags] First tag: N-NSN-ATT;  second tag: N-ASM-ATT
[compare_tags] Part of Speech   : Noun         vs Noun         → sim=1.00, weight=10
[compare_tags] Number           : Singular     vs Singular     → sim=1.00, weight=3
[compare_tags] Tense            :              vs              → sim=1.00, weight=8
[compare_tags] Voice            :              vs              → sim=1.00, weight=6
[compare_tags] Mood             :              vs              → sim=1.00, weight=6
[compare_tags] Gender           : Neuter       vs Masculine    → sim=0.20, weight=2
[compare_tags] Case             : Nominative   vs Accusative   → sim=0.20, weight=4
[compare_tags] Person           :              vs              → sim=1.00, weight=3
[compare_tags] Suffix           : Attic        vs Attic        → sim=1.00, weight=1
[compare_tags]  Overall similarity: 0.888

score={'tag1': 'N-NSN-ATT', 'tag2': 'N-ASM-ATT', 'overall_similarity': 0.8883720930232557, 'details': {'Part of Speech': {'tag1': 'Noun', 't

In [6]:
morphkit.compare_tags("N-NSN-ATT", "N-ASM-ATT")

{'tag1': 'N-NSN-ATT',
 'tag2': 'N-ASM-ATT',
 'overall_similarity': 0.8883720930232557,
 'details': {'Part of Speech': {'tag1': 'Noun',
   'tag2': 'Noun',
   'similarity': 1.0,
   'weight': 10},
  'Number': {'tag1': 'Singular',
   'tag2': 'Singular',
   'similarity': 1.0,
   'weight': 3},
  'Tense': {'tag1': '', 'tag2': '', 'similarity': 1.0, 'weight': 8},
  'Voice': {'tag1': '', 'tag2': '', 'similarity': 1.0, 'weight': 6},
  'Mood': {'tag1': '', 'tag2': '', 'similarity': 1.0, 'weight': 6},
  'Gender': {'tag1': 'Neuter',
   'tag2': 'Masculine',
   'similarity': 0.2,
   'weight': 2},
  'Case': {'tag1': 'Nominative',
   'tag2': 'Accusative',
   'similarity': 0.2,
   'weight': 4},
  'Person': {'tag1': '', 'tag2': '', 'similarity': 1.0, 'weight': 3},
  'Suffix': {'tag1': 'Attic',
   'tag2': 'Attic',
   'similarity': 1.0,
   'weight': 1}}}

In [7]:
morphkit.decode_tag("V-IAI-1S-CF",debug=True)

[decode_tag] Return ({'Part of Speech': 'Verb', 'Tense': 'Imperfect', 'Voice': 'Active', 'Mood': 'Indicative', 'Person': 'First Person', 'Number': 'Singular', 'Suffix': 'ERROR: Unknown suffix -CF'})


{'Part of Speech': 'Verb',
 'Tense': 'Imperfect',
 'Voice': 'Active',
 'Mood': 'Indicative',
 'Person': 'First Person',
 'Number': 'Singular',
 'Suffix': 'ERROR: Unknown suffix -CF'}

In [8]:
morphkit.decode_tag("N-ASF-C",debug=True)

[decode_tag] Return ({'Part of Speech': 'Noun', 'Case': 'Accusative', 'Number': 'Singular', 'Gender': 'Feminine', 'Suffix': 'Comparative'})


{'Part of Speech': 'Noun',
 'Case': 'Accusative',
 'Number': 'Singular',
 'Gender': 'Feminine',
 'Suffix': 'Comparative'}

In [9]:
baseUrl="http://10.0.1.156:1315/greek/" # (IP:port of the Docker instance)
morphkit.analyze_word_with_morpheus('e)/sxaton',baseUrl)

{'word': 'e)/sxaton',
 'raw_text': '\n:raw e)/sxaton\n\n:workw e)/sxaton\n:lem e)/sxatos\n:prvb \t\t\t\t\n:aug1 \t\t\t\t\n:stem e)sxat\t\t\t\tos_h_on\n:suff \t\t\t\t\n:end on\t masc acc sg\t\t\tos_h_on\n\n:raw e)/sxaton\n\n:workw e)/sxaton\n:lem e)/sxatos\n:prvb \t\t\t\t\n:aug1 \t\t\t\t\n:stem e)sxat\t\t\t\tos_h_on\n:suff \t\t\t\t\n:end on\t neut nom/voc/acc sg\t\t\tos_h_on\n\n:raw e)/sxaton\n\n:workw e)sxa=ton\n:lem ei)s-xa/w\n:prvb ei)s\t\t\tshort_eis\t\n:aug1 \t\t\t\t\n:stem x\t\t\t\taw_pr,aw_denom\n:suff \t\t\t\t\n:end a=ton\t pres imperat act 2nd dual\t\tcontr\taw_pr\n\n:raw e)/sxaton\n\n:workw e)sxa=ton\n:lem ei)s-xa/w\n:prvb ei)s\t\t\tshort_eis\t\n:aug1 \t\t\t\t\n:stem x\t\t\t\taw_pr,aw_denom\n:suff \t\t\t\t\n:end a=ton\t pres subj act 3rd dual\t\tcontr\taw_pr\n\n:raw e)/sxaton\n\n:workw e)sxa=ton\n:lem ei)s-xa/w\n:prvb ei)s\t\t\tshort_eis\t\n:aug1 \t\t\t\t\n:stem x\t\t\t\taw_pr,aw_denom\n:suff \t\t\t\t\n:end a=ton\t pres subj act 2nd dual\t\tcontr\taw_pr\n\n:raw e)/sxaton\n\n:w

# 3 - Production <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

## 3.1 - Load the input words <a class="anchor" id="bullet3x1"></a>

In [8]:
inputFile = "testset.txt"
with open(inputFile, "r", encoding="utf-8") as f:
    greekWords = [line.strip() for line in f if line.strip()]
print(f"Loaded {len(greekWords)} words.")

Loaded 38 words.


## 3.2 - Running the production <a class="anchor" id="bullet3x2"></a>

In [10]:
"""
Main Morpheus Analyzer

This script takes a list of Greek words in Betacode (loaded before to greekWords),
queries the local Morpheus service for each, and collects all parsed
morphological analyses into a single JSON file.

Workflow:
  1. Iterate over each word in `greekWords`, showing progress via tqdm.
  2. For each word, call `analyze_word_with_morpheus` to fetch and parse
     Morpheus’s output into a structured entry.
  3. Extract the `"analyses"` list from each entry and append it to `results`.
  4. After all words have been processed, serialize `results` to
     `morpheus_results.json` with pretty indentation.

Result:
  A JSON array where each element is the list of parse dictionaries
  (one per candidate analysis) returned by Morpheus for one input word.
"""

import requests
import re
import json
from tqdm.std import tqdm

baseUrl = "http://10.0.1.156:1315/greek/"
results = []

for word in tqdm(greekWords, desc="Processing words"):
    try:
        entry = morphkit.analyze_word_with_morpheus(word, baseUrl)
        results.append(entry["analyses"])
    except Exception as e:
        print(f"Error for {word!r}: {e}")

# write out to JSON
with open("morpheus_results.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

NameError: name 'greekWords' is not defined

# 4 - Validation <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

## 4.1 - Load the test file <a class="anchor" id="bullet4x1"></a>

In [8]:
inputFile = "testset.txt"
with open(inputFile, "r", encoding="utf-8") as f:
    testWords = [line.strip() for line in f if line.strip()]
print(f"Loaded {len(greekWords)} words.")

Loaded 38 words.


## 4.2 - Run the test file <a class="anchor" id="bullet4x2"></a>

In [13]:
"""
Comparing Morpheus and SP-tags

Reads a tab separated file with Betacode words and expected SP-tags, then it
calls Morpheus via `analyze_word_with_morpheus` to fetch and parse each form. 
Next it flattens multi-tags into individual codes, and checks for matches.
The script generates a color-coded HTML report showing, for each entry:
  • the input and expected tag with match/no-match indication
  • collapsible raw-block from Morpheus and a collapsible JSON dump
The result is writen to file `SP-tag-analysis.htm` (to be opened in a browser)
together with display of a brief summary in the notebook.
"""

debug=False

import csv
import re
import requests
import json
import html
import beta_code
from IPython.display import display, HTML
import traceback
from tqdm.std import tqdm

baseUrl = "http://10.0.1.156:1315/greek/"

# path to your failure log
fail_log_path = "failures.txt"
# ensure it exists (or clear it) before the run
open(fail_log_path, "w", encoding="utf-8").close()

def format_string(s):
    if not s:
        return "<pre>empty</pre>"

    lines = []
    for sublist in s:
        line = ', '.join(sublist)
        lines.append(line)

    formatted_lines = []

    for line in lines:
        items = line.split(', ')
        formatted_items = []

        for item in items:
            item = re.sub(r'\t', '    ', item)
            item = re.sub(r':raw', '\n:raw', item)
            item = item.strip("'")  # remove quotes
            formatted_items.append(item)

        formatted_line = '\n'.join(formatted_items)
        formatted_lines.append(formatted_line)

    if not formatted_lines:
        return "<pre></pre>"

    formatted_string = '\n'.join(formatted_lines)
    html_string = f"<pre>\n{formatted_string}\n</pre>"
    return html_string


def flatten_tags(analyses):
    """
    Given a list of parse‐dicts, return a sorted list of
    individual SP‐tags (splitting any "V-PAP-DSM/V-PAP-DSN" etc.).
    """
    flat = set()
    for p in analyses:
        tag = p.get("sp_morph_tag") or ""
        # skip if there's no tag
        if not tag.strip():
            continue
        # split on "/" and add each non-empty part
        for part in tag.split("/"):
            part = part.strip()
            if part:
                flat.add(part)
    return sorted(flat)
    

tests = []
# --- load testset --------------------------------------
# inputfile = 'problems.txt' # the words where no or no close match was found during last run
# inputfile = 'testset.txt' # sample testset (small with a few different types of tags)
inputfile = 'beta_morph_pairs.txt' # large set (one word for each morph tags found in the GNT)
with open(inputfile, 'r', encoding='utf-8') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        if not row or row[0].startswith('#'):
            continue
        tests.append({
            'word_betacode':  row[0].strip(),
            'expected_sp_tag': row[1].strip()
        })

total, passed, close_count, no_block_count, failed = len(tests), 0, 0, 0, 0
HTMLobject = ""
close_threshold = 0.8  # adjustable threshold for “close match”

# --- run tests ----------------------------------------

# Start building the HTML, including our tooltip‐CSS in the <head>
HTMLtotal = '''<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>SP-Tag Test Results</title>
  <style>
    body { font-family: sans-serif; padding: 20px; }

    /* Tooltip container */
    .tooltip {
      position: relative;
      display: inline-block;
    }
    /* Hidden content */
    .tooltip .tooltip-content {
      visibility: hidden;
      background: white;
      color: black;
      text-align: left;
      border: 1px solid #ccc;
      padding: 6px;
      position: absolute;
      z-index: 100;
      top: 100%;
      left: 50%;
      transform: translateX(-20%);
      white-space: nowrap;
    }
    /* Show on hover */
    .tooltip:hover .tooltip-content {
      visibility: visible;
    }
    /* Table styling */
    .tooltip-content table {
      border-collapse: collapse;
      font-family: monospace;
      font-size: 12px;
    }
    .tooltip-content th,
    .tooltip-content td {
      border: 1px solid #ddd;
      padding: 2px 6px;
    }
    .tooltip-content th {
      background: #f0f0f0;
    }

    .no-block { opacity: 0.6; }
  </style>
  <script>
// toggle all elements of class `cat`
function toggleFilter(cat) {
  const elems = document.getElementsByClassName(cat);
  for (let e of elems) {
    e.style.display = (e.style.display === 'none') ? 'block' : 'none';
  }
}
</script>
</head><body>
'''

for item in tqdm(tests, desc="Processing words"):
#for item in tests:
    wb       = item['word_betacode']
    expected = item['expected_sp_tag']
    uni      = beta_code.beta_code_to_greek(wb)

    try:
        entry = morphkit.analyze_word_with_morpheus(wb, baseUrl, debug=False)
    except Exception as e:
        print(f'Failing word: {wb} with error: {e}')
        failed += 1
        if debug:
            # Print the full stack trace to stderr
            tb_str = traceback.format_exc()
            print(f'Trace: {tb_str}')
        continue

    # — check for no Morpheus blocks —
    has_blocks = bool(entry.get("blocks"))
    if not has_blocks:
        # mark this entry so we can style it later
        no_block = True
    else:
        no_block = False

    actual_list = flatten_tags(entry["analyses"])
    ok          = expected in actual_list

    # Compare each generated tag
    sim_results = [morphkit.compare_tags(expected, tag, debug=False) for tag in actual_list]
    sim_results.sort(key=lambda r: r['overall_similarity'], reverse=True)

    # Determine main category/color
    if not has_blocks:
        category, color = "no-block", "gray" ; no_block_count+=1
    elif ok:
        category, color = "match", "green"; passed += 1
    elif any(r['overall_similarity'] >= close_threshold for r in sim_results):
        category, color = "close-match", "orange"; close_count += 1
    else:
        category, color = "no-match", "red"; failed += 1
        with open(fail_log_path, "a", encoding="utf-8") as f:
            f.write(f"{wb}\t{expected}\n")

    # Build links with HTML‐table tooltips
    sim_links = []
    table_html_parts = []
        
    for result in sim_results:
        tag = result["tag1"]
        details = result["details"]
    
        total_raw = 0.0
        total_weight = 0
        # Header row
        rows = [
            "<tr>"
            "<th>Feature</th><th>N1904-TF</th><th>Morpheus</th>"
            "<th>Sim</th><th>Wt</th><th>Contrib</th>"
            "</tr>"
        ]
    
        # Build one row per feature
        for feat, dv in details.items():
            kv = html.escape(str(dv["tag1"]))  # N1904-TF
            gv = html.escape(str(dv["tag2"]))  # Morpheus 
            s  = dv["similarity"]
            w  = dv["weight"]
            contrib = s * w
    
            total_raw    += contrib
            total_weight += w
    
            rows.append(
                "<tr>"
                f"<td>{feat}</td>"
                f"<td>{kv}</td>"
                f"<td>{gv}</td>"
                f"<td>{s:.2f}</td>"
                f"<td>{w}</td>"
                f"<td>{contrib:.2f}</td>"
                "</tr>"
            )
    
        # Totals row
        rows.append(
            "<tr style='font-weight:bold'>"
            "<td>Total raw</td><td></td><td></td>"
            f"<td></td><td>{total_weight}</td><td>{total_raw:.2f}</td>"
            "</tr>"
        )
        # Normalized similarity row
        normalized = (total_raw / total_weight) if total_weight else 0.0
        rows.append(
            "<tr style='font-weight:bold'>"
            "<td>Normalized</td><td></td><td></td>"
            f"<td>{normalized:.3f}</td><td></td><td></td>"
            "</tr>"
        )
        
        # Wrap up this tag’s table 
        table_html = (
            f"<h3>Analysis for tag: {html.escape(tag)}</h3>"
            "<table>"
            + "".join(rows) +
            "</table>"
        )

        # Determine similairity color
        if normalized == 1: 
            item_color='green'
        elif normalized >= close_threshold:
            item_color='orange'
        else:
            item_color='red'

        # Tooltip wrapper
        sim_links.append(
            f'<span class="tooltip">'
            f'<a href="https://tonyjurg.github.io/'
            f'Sandborg-Petersen-decoder/index.html?tag={tag}" '
            f'target="decoder" '
            f'style="color:inherit;text-decoration:none;">'
            f'{tag}'
            f'</a>: '
            f'<span style="color:{item_color};">{normalized:.2f}</span>'
            f'<div class="tooltip-content">{table_html}</div>'
            f'</span>'
        )

    sim_str = ", ".join(sim_links)

    # Append this word’s block
    HTMLobject += (
        f'\n<!-- entry-->\n<div id="entry-{wb}" class="entry {category}">'
        f'<div style="color:{color};font-weight:bold;">'
        f'<a href="https://www.perseus.tufts.edu/hopper//morph?l={wb}&la=greek" '
        f'target="perseus" style="color:black;text-decoration:none;">{wb}</a> '
        f'({uni}) ⇒ {category}</div>\n'
        f'<div style="font-weight:bold;">'
        f'N1904-TF: '
        f'<a href="https://tonyjurg.github.io/Sandborg-Petersen-decoder/index.html?tag={expected}" '
        f'target="decoder" style="color:black;text-decoration:none;">{expected}</a>'
        f'</div>\n'
        f'<div style="font-weight:bold;">Morpheus: {sim_str}</div><br>\n'
        f'<details><summary>Morpheus blocks</summary>{format_string(entry["blocks"])}</details>\n'
        f'<details><summary>Parsed JSON</summary>'
        f'<pre style="background:#f7f7f7;padding:10px;border:1px solid #ddd;'
        f'overflow:auto;white-space:pre-wrap;">'
        f'{html.escape(json.dumps(entry["analyses"], indent=4, ensure_ascii=False))}'
        f'</pre></details>\n<hr/></div>\n'
    )


HTMLtotal += (f'<h1>Morph tag matching between N1904-TF and Morpheus</h1>'
            f'<h2>Test results summary</h2>'
            f'<p>Inputfile: {inputfile}</p>'
            f'<p>counts:</p><ul>'
            f'<li>Total: {total}</li>'
            f'<li><span style="color:green;" >Match: {passed}</span> <span onclick="toggleFilter(\'match\')">(hide/show)</span></li>'
            f'<li><span style="color:orange;" >Close match: {close_count}</span> (treshold={close_threshold}) <span onclick="toggleFilter(\'close-match\')">(hide/show)</span></li>'
            f'<li><span style="color:red;" >No match: {failed}</span> <span onclick="toggleFilter(\'no-match\')">(hide/show)</span></li>'
            f'<li><span style="color:gray;" >No Morpheus block received: {no_block_count}</span> <span onclick="toggleFilter(\'no-block\')">(hide/show)</span></li></ul></p>\n'
            f'<h2>Detailed test results</h2>'
)

HTMLtotal += HTMLobject

# Close out HTML
HTMLtotal += "</body></html>"

# Write & display as before…
with open("SP-tag-analysis.htm", "w", encoding="utf-8") as f:
    f.write(HTMLtotal)


display(HTML(f'<h2>Test Summary</h2><p>Total cases: {total}<br>Matches: {passed}<br>Close match: {close_count}<br>No-matches: {failed}</p>'))
display(HTML('<p>Wrote results to SP-tag-analysis.htm</p>'))

Processing words: 100%|██████████| 1055/1055 [00:22<00:00, 45.94it/s]


# THE MULTI LEMMA VERSION

In [19]:
"""
Comparing Morpheus and SP-tags

Reads a tab separated file with Betacode words and expected SP-tags, then it
calls Morpheus via `analyze_word_with_morpheus` to fetch and parse each form. 
Next it flattens multi-tags into individual codes, and checks for matches.
The script generates a color-coded HTML report showing, for each entry:
  • the input and expected tag with match/no-match indication
  • collapsible raw-block from Morpheus and a collapsible JSON dump
The result is writen to file `SP-tag-analysis.htm` (to be opened in a browser)
together with display of a brief summary in the notebook.
"""

debug=False

HTMLtotal=''
import csv
import re
import requests
import json
import html
import beta_code
from IPython.display import display, HTML
import traceback
import urllib.parse
from collections import defaultdict
from tqdm.std import tqdm

baseUrl = "http://10.0.1.156:1315/greek/"

# path to your failure log
fail_log_path = "failures.txt"
# ensure it exists (or clear it) before the run
open(fail_log_path, "w", encoding="utf-8").close()

def format_string(s):
    if not s:
        return "<pre>empty</pre>"

    lines = []
    for sublist in s:
        line = ', '.join(sublist)
        lines.append(line)

    formatted_lines = []

    for line in lines:
        items = line.split(', ')
        formatted_items = []

        for item in items:
            item = re.sub(r'\t', '    ', item)
            item = re.sub(r':raw', '\n:raw', item)
            item = item.strip("'")  # remove quotes
            formatted_items.append(item)

        formatted_line = '\n'.join(formatted_items)
        formatted_lines.append(formatted_line)

    if not formatted_lines:
        return "<pre></pre>"

    formatted_string = '\n'.join(formatted_lines)
    html_string = f"<pre>\n{formatted_string}\n</pre>"
    return html_string


def flatten_tags(analyses):
    """
    Given a list of parse‐dicts, return a sorted list of
    individual SP‐tags (splitting any "V-PAP-DSM/V-PAP-DSN" etc.).
    """
    flat = set()
    for p in analyses:
        tag = p.get("sp_morph_tag") or ""
        # skip if there's no tag
        if not tag.strip():
            continue
        # split on "/" and add each non-empty part
        for part in tag.split("/"):
            part = part.strip()
            if part:
                flat.add(part)
    return sorted(flat)
    

tests = []
# --- load testset --------------------------------------
#inputfile = 'problems.txt' # the words where no or no close match was found during last run
inputfile = 'test-set.txt' # sample testset (small with a few different types of tags)
#inputfile = 'beta_morph_pairs.txt' # large set (one word for each morph tags found in the GNT)
with open(inputfile, 'r', encoding='utf-8') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        if not row or row[0].startswith('#'):
            continue
        tests.append({
            'word_betacode':  row[0].strip(),
            'expected_sp_tag': row[1].strip()
        })

total, passed, close_count, no_block_count, failed = len(tests), 0, 0, 0, 0
close_threshold = 0.8  # adjustable threshold for “close match”

# --- run tests ----------------------------------------

# Start building the HTML, including our tooltip‐CSS in the <head>
HTMLstart = '''<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>SP-Tag Test Results</title>
  <style>
    body { font-family: sans-serif; padding: 20px; }

    /* Tooltip container */
    .tooltip {
      position: relative;
      display: inline-block;
      cursor: help;
    }
    /* Hidden content */
    .tooltip .tooltip-content {
      visibility: hidden;
      background: white;
      color: black;
      text-align: left;
      border: 1px solid #ccc;
      padding: 6px;
      position: absolute;
      z-index: 100;
      top: 100%;
      left: 50%;
      transform: translateX(-20%);
      white-space: nowrap;
    }
    /* Show on hover */
    .tooltip:hover .tooltip-content {
      visibility: visible;
    }
    /* Table styling */
    .tooltip-content table {
      border-collapse: collapse;
      font-family: monospace;
      font-size: 12px;
    }
    .tooltip-content th,
    .tooltip-content td {
      border: 1px solid #ddd;
      padding: 2px 6px;
    }
    .tooltip-content th {
      background: #f0f0f0;
    }

    .no-block { opacity: 0.6; }
  </style>
  
  <script>
// toggle all elements of class `cat`
function toggleFilter(cat) {
  const elems = document.getElementsByClassName(cat);
  for (let e of elems) {
    e.style.display = (e.style.display === 'none') ? 'block' : 'none';
  }
}
</script>
</head><body>
'''


for item in tqdm(tests, desc="Processing words"):
    wb       = item['word_betacode']
    expected = item['expected_sp_tag']
    uni      = beta_code.beta_code_to_greek(wb)

    try:
        entry = morphkit.analyze_word_with_morpheus(wb, baseUrl, debug=False)
    except Exception as e:
        print(f'Failing word: {wb} with error: {e}')
        failed += 1
        if debug:
            # Print the full stack trace to stderr
            print(traceback.format_exc(), file=sys.stderr)
        continue

    # — check for no Morpheus blocks —
    has_blocks = bool(entry.get("blocks"))
    no_block   = not has_blocks

    # Group analyses by their lemma_beta
    lemma_groups = defaultdict(list)
    for parse in entry.get("analyses", []):
        # get the raw betacode (might be None), then drop all '*' markers
        lemma = (parse.get("lemma_beta", "") or "").replace("*", "")
        lemma_groups[lemma].append(parse)

    # Start the outer entry div for this word
    HTMLobject += (
        f'\n<!-- entry for {wb} -->\n'
        f'<div id="entry-{wb}" class="entry">\n'
        f'<h2><a href="https://www.perseus.tufts.edu/hopper/morph?l={wb}&la=greek" '
        f'target="perseus" style="color:black;text-decoration:none;">{wb} ({uni})</a></h2>\n'
    )

    # If no blocks at all, show a single UNK‐style line and close
    if no_block:
        no_block_count += 1
        HTMLobject += (
            f'  <div style="color:gray;font-weight:bold;">no-block</div>\n'
            f'</div>\n<hr/>\n'
        )
        continue

    # Otherwise, iterate each lemma subgroup
    for lemma_beta, parses in lemma_groups.items():
        lemma_uni = beta_code.beta_code_to_greek(lemma_beta)
        # Flatten tags for this lemma only
        actual_list = flatten_tags(parses)
        ok          = expected in actual_list

        # Build similarity results
        sim_results = [
            morphkit.compare_tags(expected, tag, debug=False)
            for tag in actual_list
        ]
        sim_results.sort(key=lambda r: r['overall_similarity'], reverse=True)

        # Pick a category/color per lemma
        if ok:
            category, color = "match", "green"
            passed += 1
        elif any(r['overall_similarity'] >= close_threshold for r in sim_results):
            category, color = "close-match", "orange"
            close_count += 1
        else:
            category, color = "no-match", "red"
            failed += 1
            with open(fail_log_path, "a", encoding="utf-8") as f:
                f.write(f"{wb}\t{lemma_beta}\t{expected}\n")

        # Lemma header
        HTMLobject += (
            f'  <div class="lemma-group" style="margin-left:1em;"  class="{category}">\n'
            f'    <h4>Lemma: {lemma_beta} ({lemma_uni}) '
            f'<span style="color:{color};font-weight:bold;">[{category}]</span></h4>\n'
        )

        # Build tooltip links exactly as before
        sim_links = []
        for result in sim_results:
            tag     = result["tag1"]
            details = result["details"]

            total_raw = 0.0
            total_wt  = 0
            rows = ["<tr><th>Feature</th><th>TF</th><th>Mph</th><th>Sim</th><th>Wt</th><th>Contrib</th></tr>"]
            for feat, dv in details.items():
                kv      = html.escape(str(dv["tag1"])) # N1904-TF
                gv      = html.escape(str(dv["tag2"])) # Morpheus
                s       = dv["similarity"]
                w       = dv["weight"]
                contrib = s * w
                total_raw += contrib
                total_wt  += w
                rows.append(
                    "<tr>"
                    f"<td>{feat}</td><td>{kv}</td><td>{gv}</td>"
                    f"<td>{s:.2f}</td><td>{w}</td><td>{contrib:.2f}</td>"
                    "</tr>"
                )
            # totals
            rows.append(
                "<tr style='font-weight:bold'>"
                f"<td>Total</td><td></td><td></td><td></td><td>{total_wt}</td><td>{total_raw:.2f}</td>"
                "</tr>"
            )
            norm = (total_raw / total_wt) if total_wt else 0.0
            rows.append(
                "<tr style='font-weight:bold'>"
                f"<td>Norm</td><td></td><td></td><td>{norm:.3f}</td><td></td><td></td>"
                "</tr>"
            )

            table_html = (
                f"<h4>Tag: {html.escape(tag)}</h4>"
                "<table>" + "".join(rows) + "</table>"
            )

            # color by similarity
            item_color = (
                "green" if norm == 1
                else "orange" if norm >= close_threshold
                else "red"
            )

            sim_links.append(
                f'<span class="tooltip">'
                f'<a href="https://tonyjurg.github.io/Sandborg-Petersen-decoder/index.html?tag={tag}" '
                f'target="decoder" style="text-decoration:none;">{tag}</a>: '
                f'<span style="color:{item_color};">{norm:.2f}</span>'
                f'<div class="tooltip-content">{table_html}</div>'
                f'</span>'
            )

        # Render the TF vs Morpheus line with all tooltips
        HTMLobject += (
            f'<div>'
            f'<strong>N1904-TF:</strong> '
            f'<a href="https://tonyjurg.github.io/Sandborg-Petersen-decoder/index.html?tag={expected}" '
            f'target="decoder" style="text-decoration:none;">{expected}</a></div>\n'
            f'<div><strong>Morpheus:</strong>' 
            + ", ".join(sim_links) +
            "</div>\n"
        )

        # Blocks and JSON for this lemma
        HTMLobject += (
            f'    <details><summary>Blocks</summary>'
            f'{format_string(entry["blocks"])}</details>\n'
            f'    <details><summary>JSON</summary>'
            f'<pre>{html.escape(json.dumps(parses, indent=2, ensure_ascii=False))}</pre>'
            f'</details>\n'
            f'  </div>\n'
        )

    # Close the word entry
    HTMLobject += '</div>\n<hr/>\n'


# After loop, wrap up HTMLtotal 
HTMLtotal =  (HTMLstart +
    f'<h1>Morph tag matching between N1904-TF and Morpheus</h1>'
    f'<h2>Test results summary</h2>'
    f'<p>Inputfile: {inputfile}</p>'
    f'<ul>'
    f'<li>Total: {total}</li>'
    f'<li><span style="color:green;">Match: {passed}</span> '
    f'<span onclick="toggleFilter(\'match\')">(hide/show)</span></li>'
    f'<li><span style="color:orange;">Close match: {close_count}</span> '
    f'(threshold={close_threshold}) '
    f'<span onclick="toggleFilter(\'close-match\')">(hide/show)</span></li>'
    f'<li><span style="color:red;">No match: {failed}</span> '
    f'<span onclick="toggleFilter(\'no-match\')">(hide/show)</span></li>'
    f'<li><span style="color:gray;">No Morpheus block received: {no_block_count}</span> '
    f'<span onclick="toggleFilter(\'no-block\')">(hide/show)</span></li>'
    f'</ul>\n'
    f'<h2>Detailed test results</h2>\n'
    + HTMLobject
)


# Close out HTML
HTMLtotal += "</body></html>"

# Write & display as before…
with open("SP-tag-analysis_M.htm", "w", encoding="utf-8") as f:
    f.write(HTMLtotal)


display(HTML(f'<h2>Test Summary</h2><p>Total cases: {total}<br>Matches: {passed}<br>Close match: {close_count}<br>No-matches: {failed}</p>'))
display(HTML('<p>Wrote results to SP-tag-analysis.htm</p>'))

Processing words: 100%|██████████| 1055/1055 [00:27<00:00, 38.03it/s]


In [None]:
# Also display in notebook (dit werkt maar levert echt heeeel veeel data op...)
display(HTML(HTMLtotal))

In [22]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import csv
import json
import html as html_mod
from collections import defaultdict
from functools import lru_cache
from tqdm.std import tqdm

import beta_code
import morphkit 

# ————— Configuration —————
#inputfile = 'problems.txt' # the words where no or no close match was found during last run
#inputfile = 'testset.txt' # sample testset (small with a few different types of tags)
#inputfile = 'beta_morph_pairs.txt' # large set (one word for each morph tags found in the GNT)
INPUT_FILE   = "beta_morph_pairs.txt"          # TSV: word_betacode \t expected_sp_tag
OUTPUT_FILE  = "SP-tag-analysis_3.html"
BASE_URL     = "http://10.0.1.156:1315/greek/"
CLOSE_THRESH = 0.8
# ———————————————————————

# 1) Caches
@lru_cache(maxsize=None)
def fetch_analysis(word_betacode):
    return morphkit.analyze_word_with_morpheus(word_betacode, BASE_URL, debug=False)

@lru_cache(maxsize=None)
def to_greek(word_betacode):
    return beta_code.beta_code_to_greek(word_betacode)

# 2) Helpers
def flatten_tags(analyses):
    flat = set()
    for p in analyses:
        for part in (p.get("sp_morph_tag") or "").split("/"):
            part = part.strip()
            if part:
                flat.add(part)
    return sorted(flat)

ROW_TMPL = (
    "<tr>"
    "<td>{feat}</td>"
    "<td>{known}</td>"
    "<td>{gen}</td>"
    "<td>{sim:.2f}</td>"
    "<td>{wt}</td>"
    "<td>{contrib:.2f}</td>"
    "</tr>"
)

def make_table(details):
    parts = [
        "<tr>"
        "<th>Feature</th><th>TF</th><th>Mph</th>"
        "<th>Sim</th><th>Wt</th><th>Contrib</th>"
        "</tr>"
    ]
    total_raw = 0.0
    total_wt  = 0
    for feat, dv in details.items():
        known   = html_mod.escape(str(dv["tag1"] or ""))  #N1904-TF
        gen     = html_mod.escape(str(dv["tag2"] or "")) #Morpheus
        sim     = dv["similarity"]
        wt      = dv["weight"]
        contrib = sim * wt
        total_raw += contrib
        total_wt  += wt
        parts.append(ROW_TMPL.format(
            feat=feat, known=known, gen=gen, sim=sim, wt=wt, contrib=contrib
        ))
    parts.append(
        "<tr style='font-weight:bold'>"
        f"<td>Total</td><td></td><td></td><td></td>"
        f"<td>{total_wt}</td><td>{total_raw:.2f}</td>"
        "</tr>"
    )
    norm = (total_raw/total_wt) if total_wt else 0.0
    parts.append(
        "<tr style='font-weight:bold'>"
        f"<td>Norm</td><td></td><td></td>"
        f"<td>{norm:.3f}</td><td></td><td></td>"
        "</tr>"
    )
    return "<table>" + "".join(parts) + "</table>"

def format_blocks(blocks):
    if not blocks:
        return "<pre>empty</pre>"
    lines = []
    for sub in blocks:
        line = ", ".join(sub)
        line = line.replace("\t", "    ").replace(":raw", "\n:raw")
        lines.append(line.strip("'"))
    return "<pre>\n" + "\n".join(lines) + "\n</pre>"

# 3) Load tests
tests = []
with open(INPUT_FILE, newline="", encoding="utf-8") as f:
    rdr = csv.reader(f, delimiter="\t")
    for row in rdr:
        if not row or row[0].startswith("#"):
            continue
        tests.append({
            "word_betacode":  row[0].strip(),
            "expected_sp_tag": row[1].strip()
        })

    # 4) Process and build entries
    summary = {"total": len(tests), "passed":0, "close":0, "failed":0, "no_block":0}
    entries = []

    for item in tqdm(tests, desc="Processing words"):
        wb       = item["word_betacode"]
        expected = item["expected_sp_tag"]
        uni      = to_greek(wb)
    
        # fetch analyses
        entry = fetch_analysis(wb)
        blocks   = entry.get("blocks", [])
        analyses = entry.get("analyses", [])
        if not blocks:
            summary["no_block"] += 1
    
        # group by lemma_beta (strip '*')
        lemma_groups = defaultdict(list)
        for p in analyses:
            key = (p.get("lemma_beta") or "").replace("*","")
            lemma_groups[key].append(p)
    
        # start this word’s HTML
        html = []
        html.append(f'<div id="entry-{wb}">')
        html.append(f'<h2><a href="https://www.perseus.tufts.edu/hopper/morph?l={wb}&la=greek" target="_blank">{wb} ({uni})</a></h2>')
    
        if not blocks:
            html.append('<div class="no-block">no Morpheus blocks</div>')
            html.append('</div><hr/>')
            entries.append("\n".join(html))
            continue
    
        # per-lemma subentries
        for lemma, parses in lemma_groups.items():
            lemma_uni = to_greek(lemma)
            tags      = flatten_tags(parses)
    
            # find best tag & score, stopping on perfect match
            best_tag, best_score = None, -1.0
            for t in tags:
                sc = morphkit.compare_tags(expected, t, debug=False)["overall_similarity"]
                if sc > best_score:
                    best_tag, best_score = t, sc
                    if sc == 1.0:
                        break
    
            # category
            if best_score == 1.0:
                cat, col = "match", "green"
                summary["passed"] += 1
            elif best_score >= CLOSE_THRESH:
                cat, col = "close", "orange"
                summary["close"] += 1
            else:
                cat, col = "no-match", "red"
                summary["failed"] += 1
    
            html.append(f'<div class="lemma-group {cat}">')
            html.append(f'  <h3>Lemma: {lemma} ({lemma_uni}) <span style="color:{col};">[{cat}]</span></h3>')
            html.append('  <div><strong>Expected:</strong> '
                        f'<a href="https://tonyjurg.github.io/Sandborg-Petersen-decoder/index.html?tag={expected}" target="_blank">{expected}</a></div>')
    
            # tooltip for best_tag
            cmp_res = morphkit.compare_tags(expected, best_tag, debug=False)
            table   = make_table(cmp_res["details"])
            html.append('  <div><strong>Morpheus:</strong> '
                        f'<span class="tooltip">{best_tag} : {best_score:.2f}'
                        f'<div class="tooltip-content"><h4>Tag: {best_tag}</h4>{table}</div>'
                        '</span></div>')
    
            # show blocks & JSON only on non-perfect
            if cat != "match":
                html.append('  <details><summary>Blocks</summary>' + format_blocks(blocks) + '</details>')
                html.append('  <details><summary>Analyses JSON</summary><pre>' +
                            html_mod.escape(json.dumps(parses, indent=2, ensure_ascii=False)) +
                            '</pre></details>')
    
            html.append('</div>')  # close lemma-group
    
        html.append('</div><hr/>')  # close entry
        entries.append("\n".join(html))

    # 5) Write out final HTML
    with open(OUTPUT_FILE, "w", encoding="utf-8") as out:
        # write the header
        out.write(f"""<!DOCTYPE html>
        <html lang="en">
        <head><meta charset="utf-8">
        <title>SP-Tag Analysis</title>
        <style>
          body {{ font-family:sans-serif; padding:20px }}
          .tooltip {{ position:relative; display:inline-block; cursor:help }}
          .tooltip .tooltip-content {{ display:none; position:absolute; top:1.2em; left:0;
            background:white; border:1px solid #ccc; padding:6px; z-index:100;
            white-space:nowrap }}
          .tooltip:hover .tooltip-content {{ display:block }}
          .tooltip-content table {{ border-collapse:collapse; font-family:monospace; font-size:12px }}
          .tooltip-content th, .tooltip-content td {{ border:1px solid #ddd; padding:2px 6px }}
          .tooltip-content th {{ background:#f0f0f0 }}
          .no-block {{ opacity:0.6; color:gray }}
        </style>
        </head>
        <body>
        <h1>SP-Tag Analysis Report</h1>
        <ul>
          <li>Total words: {summary['total']}</li>
          <li>Passed: <span style="color:green;">{summary['passed']}</span></li>
          <li>Close: <span style="color:orange;">{summary['close']}</span></li>
          <li>Failed: <span style="color:red;">{summary['failed']}</span></li>
          <li>No blocks: <span style="color:gray;">{summary['no_block']}</span></li>
        </ul>
        <hr/>
        """)
        
        for e in entries:
            out.write(e + "\n")
        
        out.write("</body></html>")

print(f"✅ Report written to {OUTPUT_FILE}")


Processing words: 100%|██████████| 1055/1055 [00:18<00:00, 57.43it/s]

✅ Report written to SP-tag-analysis_3.html





# Nu met lemma erbij

In [14]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import csv
import json
import html as html_mod
from collections import defaultdict
from functools import lru_cache
from tqdm.std import tqdm

import beta_code
import morphkit 

# ————— Configuration —————
#inputfile = 'problems.txt' # the words where no or no close match was found during last run
#inputfile = 'testset.txt' # sample testset (small with a few different types of tags)
#inputfile = 'beta_morph_pairs.txt' # large set (one word for each morph tags found in the GNT)
INPUT_FILE   = "test-set.txt"          # TSV: word_betacode \t expected_sp_tag
OUTPUT_FILE  = "SP-tag-analysis_3.html"
BASE_URL     = "http://10.0.1.156:1315/greek/"
CLOSE_THRESH = 0.8
# ———————————————————————

# 1) Caches
@lru_cache(maxsize=None)
def fetch_analysis(word_betacode):
    return morphkit.analyze_word_with_morpheus(word_betacode, BASE_URL, debug=False)

@lru_cache(maxsize=None)
def to_greek(word_betacode):
    return beta_code.beta_code_to_greek(word_betacode)

# 2) Helpers
def flatten_tags(analyses):
    flat = set()
    for p in analyses:
        for part in (p.get("sp_morph_tag") or "").split("/"):
            part = part.strip()
            if part:
                flat.add(part)
    return sorted(flat)

ROW_TMPL = (
    "<tr>"
    "<td>{feat}</td>"
    "<td>{known}</td>"
    "<td>{gen}</td>"
    "<td>{sim:.2f}</td>"
    "<td>{wt}</td>"
    "<td>{contrib:.2f}</td>"
    "</tr>"
)

def make_table(details):
    parts = [
        "<tr>"
        "<th>Feature</th><th>TF</th><th>Mph</th>"
        "<th>Sim</th><th>Wt</th><th>Contrib</th>"
        "</tr>"
    ]
    total_raw = 0.0
    total_wt  = 0
    for feat, dv in details.items():
        known   = html_mod.escape(str(dv["tag1"] or ""))  #N1904-TF
        gen     = html_mod.escape(str(dv["tag2"] or "")) #Morpheus
        sim     = dv["similarity"]
        wt      = dv["weight"]
        contrib = sim * wt
        total_raw += contrib
        total_wt  += wt
        parts.append(ROW_TMPL.format(
            feat=feat, known=known, gen=gen, sim=sim, wt=wt, contrib=contrib
        ))
    parts.append(
        "<tr style='font-weight:bold'>"
        f"<td>Total</td><td></td><td></td><td></td>"
        f"<td>{total_wt}</td><td>{total_raw:.2f}</td>"
        "</tr>"
    )
    norm = (total_raw/total_wt) if total_wt else 0.0
    parts.append(
        "<tr style='font-weight:bold'>"
        f"<td>Norm</td><td></td><td></td>"
        f"<td>{norm:.3f}</td><td></td><td></td>"
        "</tr>"
    )
    return "<table>" + "".join(parts) + "</table>"

def format_blocks(blocks):
    if not blocks:
        return "<pre>empty</pre>"
    lines = []
    for sub in blocks:
        line = ", ".join(sub)
        line = line.replace("\t", "    ").replace(":raw", "\n:raw")
        lines.append(line.strip("'"))
    return "<pre>\n" + "\n".join(lines) + "\n</pre>"

# 3) Load tests
tests = []
with open(INPUT_FILE, newline="", encoding="utf-8") as f:
    rdr = csv.reader(f, delimiter="\t")
    for row in rdr:
        if not row or row[0].startswith("#"):
            continue
        tests.append({
            "expected_sp_tag": row[0].strip(),
            "word_betacode":  row[1].strip(),
            "lemma_betacode":  row[1].strip()
        })

    # 4) Process and build entries
    summary = {"total": len(tests), "passed":0, "close":0, "failed":0, "no_block":0}
    entries = []

    for item in tqdm(tests, desc="Processing words"):
        wb       = item["word_betacode"]
        expected = item["expected_sp_tag"]
        uni      = to_greek(wb)
    
        # fetch analyses
        entry = fetch_analysis(wb)
        blocks   = entry.get("blocks", [])
        analyses = entry.get("analyses", [])
        if not blocks:
            summary["no_block"] += 1
    
        # group by lemma_beta (strip '*')
        lemma_groups = defaultdict(list)
        for p in analyses:
            key = (p.get("lemma_base3ta") or "").replace("*","")
            lemma_groups[key].append(p)
    
        # start this word’s HTML
        html = []
        html.append(f'<div id="entry-{wb}">')
        html.append(f'<h2><a href="https://www.perseus.tufts.edu/hopper/morph?l={wb}&la=greek" target="_blank">{wb} ({uni})</a></h2>')
    
        if not blocks:
            html.append('<div class="no-block">no Morpheus blocks</div>')
            html.append('</div><hr/>')
            entries.append("\n".join(html))
            continue
    
        # per-lemma subentries
        for lemma, parses in lemma_groups.items():
            lemma_uni = to_greek(lemma)
            tags      = flatten_tags(parses)
    
            # find best tag & score, stopping on perfect match
            best_tag, best_score = None, -1.0
            for t in tags:
                sc = morphkit.compare_tags(expected, t, debug=False)["overall_similarity"]
                if sc > best_score:
                    best_tag, best_score = t, sc
                    if sc == 1.0:
                        break
    
            # category
            if best_score == 1.0:
                cat, col = "match", "green"
                summary["passed"] += 1
            elif best_score >= CLOSE_THRESH:
                cat, col = "close", "orange"
                summary["close"] += 1
            else:
                cat, col = "no-match", "red"
                summary["failed"] += 1
    
            html.append(f'<div class="lemma-group {cat}">')
            if item["lemma_betacode"]==lemma:
                lemma_match='green'
            else:
                lemma_match='grey'
            html.append(f'  <h3><span style="color:{lemma_match};">Lemma: {lemma} ({lemma_uni})</span> <span style="color:{col};">[{cat}]</span></h3>')
            html.append('  <div><strong>Expected:</strong> '
                        f'<a href="https://tonyjurg.github.io/Sandborg-Petersen-decoder/index.html?tag={expected}" target="_blank">{expected}</a></div>')
    
            # tooltip for best_tag
            cmp_res = morphkit.compare_tags(expected, best_tag, debug=False)
            table   = make_table(cmp_res["details"])
            html.append('  <div><strong>Morpheus:</strong> '
                        f'<span class="tooltip">{best_tag} : {best_score:.2f}'
                        f'<div class="tooltip-content"><h4>Tag: {best_tag}</h4>{table}</div>'
                        '</span></div>')
    
            # show blocks & JSON only on non-perfect
            if cat != "match":
                html.append('  <details><summary>Blocks</summary>' + format_blocks(blocks) + '</details>')
                html.append('  <details><summary>Analyses JSON</summary><pre>' +
                            html_mod.escape(json.dumps(parses, indent=2, ensure_ascii=False)) +
                            '</pre></details>')
    
            html.append('</div>')  # close lemma-group
    
        html.append('</div><hr/>')  # close entry
        entries.append("\n".join(html))

    # 5) Write out final HTML
    with open(OUTPUT_FILE, "w", encoding="utf-8") as out:
        # write the header
        out.write(f"""<!DOCTYPE html>
        <html lang="en">
        <head><meta charset="utf-8">
        <title>SP-Tag Analysis</title>
        <style>
          body {{ font-family:sans-serif; padding:20px }}
          .tooltip {{ position:relative; display:inline-block; cursor:help }}
          .tooltip .tooltip-content {{ display:none; position:absolute; top:1.2em; left:0;
            background:white; border:1px solid #ccc; padding:6px; z-index:100;
            white-space:nowrap }}
          .tooltip:hover .tooltip-content {{ display:block }}
          .tooltip-content table {{ border-collapse:collapse; font-family:monospace; font-size:12px }}
          .tooltip-content th, .tooltip-content td {{ border:1px solid #ddd; padding:2px 6px }}
          .tooltip-content th {{ background:#f0f0f0 }}
          .no-block {{ opacity:0.6; color:gray }}
        </style>
        </head>
        <body>
        <h1>SP-Tag Analysis Report</h1>
        <ul>
          <li>Total words: {summary['total']}</li>
          <li>Passed: <span style="color:green;">{summary['passed']}</span></li>
          <li>Close: <span style="color:orange;">{summary['close']}</span></li>
          <li>Failed: <span style="color:red;">{summary['failed']}</span></li>
          <li>No blocks: <span style="color:gray;">{summary['no_block']}</span></li>
        </ul>
        <hr/>
        """)
        
        for e in entries:
            out.write(e + "\n")
        
        out.write("</body></html>")

print(f"✅ Report written to {OUTPUT_FILE}")


Processing words: 100%|██████████| 1055/1055 [00:19<00:00, 53.23it/s]

✅ Report written to SP-tag-analysis_3.html





# 5 - Footnotes and attribution<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook used the following sources for the analysis and implementation:

- [Greek Beta Code standard](https://stephanus.tlg.uci.edu/encoding/BCM.pdf)
- [beta-code-py](https://github.com/perseids-tools/beta-code-py)

# 6 - Required libraries<a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require the following Python libraries to be installed in the environment:

    beta_code 
    json
    os  
    pathlib
    re
    requests
    unicodedata

You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 7 - Notebook version<a class="anchor" id="bullet7"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.1</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>8 May 2025</td>
    </tr>
  </table>
</div>