## USFM marker placement research

helpers and processes for efficiently evaluating potential improvements

### Scripts for vetting and prepping test pairs

To be able to treat a pair of USFM files as a source-draft pair for the purposes of evaluating marker placement, they need to have the exact same USFM structure, meaning that when parsed, they have the exact same set of ScriptureRefs. To be useful, they also need to use paragraph/style markers in the exact same way, at least in enough verses to make testing worthwile.  Once you have a pair that is likely to be similar (e.g. the "draft" is a book produced by a translation team and you know their source/reference), these scripts help to speed up the process of making sure they are fully compatible.

The code under "Find vref differences" helps to identify differences in the sets of ScriptureRefs. The two main differences are usually non-verse paragraphs (like remarks and section headers) and verse ranges. Since non-verse paragraphs are not relevant to this task, I usually delete all of them from both files ("Remove embeds in place"). For verse ranges, I will either turn the equivalent verses in the other file into a verse range as well, or I will delete those verses from both files if there is already enough data without them.

The code under "Print out all paragraph and character markers for a book" does just that on a by-verse basis. This is the most helpful in deciding whether or not to use a pair of files for evaluation. Since it is normally not easy to determine where paragraph/style markers should go in one translation given their placement in another, it is pretty much a necessity for a pair of files to have enough verses with the same set of markers from the get-go. Of course, having the same set of markers does not guarantee that they are being used in the same way, but it is a necessary assumption given the lack of true ground truth data. I typically look for 2-3 chapters (more if they match up well) that have a fair number of markers of the types that I'm looking for and then for each marker discrepancy in those chapters, I either delete the conflicting marker(s) or add/change marker labels if it's obvious what is causing the difference. As long as you only score the placement of the chapters you want to test (more on that below), you don't have to worry about mismatched paragraph or style markers getting placed elsewhere in the book.

Remove embeds in place

In [None]:
from pathlib import Path
from machine.corpora import UpdateUsfmParserHandler, UpdateUsfmMarkerBehavior, UpdateUsfmTextBehavior, parse_usfm

file_path = Path("")
encoding = "utf-8-sig" # utf-8-sig cp1252
with file_path.open(encoding=encoding) as f:
    usfm = f.read()
handler = UpdateUsfmParserHandler(
    rows=[],
    text_behavior=UpdateUsfmTextBehavior.PREFER_EXISTING,
    paragraph_behavior=UpdateUsfmMarkerBehavior.PRESERVE,
    embed_behavior=UpdateUsfmMarkerBehavior.STRIP,
    style_behavior=UpdateUsfmMarkerBehavior.PRESERVE,
    preserve_paragraph_styles=[],
)
parse_usfm(usfm, handler)
with file_path.open("w", encoding=encoding) as f:
    f.write(handler.get_usfm())

Find vref differences

In [None]:
from pathlib import Path

src_fpath = Path("")
trg_fpath = Path("")
src_out = Path("vrefs_src.txt")
trg_out = Path("vrefs_trg.txt")

ignore = ["q1", "q2", "p", "b", "li1", "q" , "m"]

# utf-8-sig cp1252
with src_fpath.open(encoding="utf-8-sig") as f, src_out.open("w") as out:
    for line in f:
        marker = line.split(" ")[0].strip() + "\n"
        if not any(marker[1:-1].startswith(p) for p in ignore):
            out.write(marker)
with trg_fpath.open(encoding="utf-8-sig") as f, trg_out.open("w") as out:
    for line in f:
        marker = line.split(" ")[0].strip() + "\n"
        if not any(marker[1:-1].startswith(p) for p in ignore):
            out.write(marker)

Print out all paragraph and character markers for a book

In [None]:
from pathlib import Path
from machine.corpora import FileParatextProjectSettingsParser, UsfmFileText, UsfmTokenizer, UsfmTokenType

book = "HEB"
src_proj_path = Path("")
src_book_path = Path("")
trg_proj_path = Path("")
trg_book_path = Path("")
src_out = Path("markers_src.txt")
trg_out = Path("markers_trg.txt")

all_markers = set()

# file 1
settings = FileParatextProjectSettingsParser(src_proj_path).parse()
file_text = UsfmFileText(settings.stylesheet,settings.encoding,book,src_book_path,include_markers=True,include_all_text=True,project=settings.name)

to_delete = []
vrefs = []
usfm_markers = []
usfm_tokenizer = UsfmTokenizer(settings.stylesheet)
for sent in file_text:
    if len(sent.ref.path) > 0 and sent.ref.path[-1].name == "rem":
        continue

    vrefs.append(sent.ref)
    usfm_markers.append([])
    usfm_toks = usfm_tokenizer.tokenize(sent.text.strip())
    
    ignore_scope = None
    for j, tok in enumerate(usfm_toks):
        if ignore_scope is not None:
            if tok.type == UsfmTokenType.END and tok.marker[:-1] == ignore_scope.marker:
                ignore_scope = None
        elif tok.type == UsfmTokenType.NOTE or (tok.type == UsfmTokenType.CHARACTER and tok.marker in to_delete):
            ignore_scope = tok
        elif tok.type in [UsfmTokenType.PARAGRAPH, UsfmTokenType.CHARACTER, UsfmTokenType.END]:
            usfm_markers[-1].append(tok.marker)
            all_markers.add(tok.marker.strip("+*"))

with src_out.open("w", encoding=settings.encoding) as f:
    for ref, markers in zip(vrefs, usfm_markers):
        f.write(f"{ref} {markers}\n")

chapter_totals = [0]
curr_chapter = 1
for ref, markers in zip(vrefs, usfm_markers):
    if ref.chapter_num != curr_chapter:
        chapter_totals.append(0)
        curr_chapter += 1
    chapter_totals[-1] += len(markers)
with Path("marker_counts_src.txt").open("w", encoding=settings.encoding) as f:
    f.write(f"{chapter_totals}\n")
    for ref, markers in zip(vrefs, usfm_markers):
        f.write(f"{ref} {len(markers)}\n")


# file 2
settings = FileParatextProjectSettingsParser(trg_proj_path).parse()
file_text = UsfmFileText(settings.stylesheet,settings.encoding,book,trg_book_path,include_markers=True,include_all_text=True,project=settings.name)

vrefs = []
usfm_markers = []
usfm_tokenizer = UsfmTokenizer(settings.stylesheet)
for sent in file_text:
    if len(sent.ref.path) > 0 and sent.ref.path[-1].name == "rem":
        continue

    vrefs.append(sent.ref)
    usfm_markers.append([])
    usfm_toks = usfm_tokenizer.tokenize(sent.text.strip())
    
    ignore_scope = None
    for j, tok in enumerate(usfm_toks):
        if ignore_scope is not None:
            if tok.type == UsfmTokenType.END and tok.marker[:-1] == ignore_scope.marker:
                ignore_scope = None
        elif tok.type == UsfmTokenType.NOTE or (tok.type == UsfmTokenType.CHARACTER and tok.marker in to_delete):
            ignore_scope = tok
        elif tok.type in [UsfmTokenType.PARAGRAPH, UsfmTokenType.CHARACTER, UsfmTokenType.END]:
            usfm_markers[-1].append(tok.marker)
            all_markers.add(tok.marker.strip("+*"))

with trg_out.open("w", encoding=settings.encoding) as f:
    for ref, markers in zip(vrefs, usfm_markers):
        f.write(f"{ref} {markers}\n")

chapter_totals = [0]
curr_chapter = 1
for ref, markers in zip(vrefs, usfm_markers):
    if ref.chapter_num != curr_chapter:
        chapter_totals.append(0)
        curr_chapter += 1
    chapter_totals[-1] += len(markers)
with Path("marker_counts_trg.txt").open("w", encoding=settings.encoding) as f:
    f.write(f"{chapter_totals}\n")
    for ref, markers in zip(vrefs, usfm_markers):
        f.write(f"{ref} {len(markers)}\n")

print(all_markers)

### Running and evaluating marker placement

silnlp.common.postprocess_draft can be used with the `--source` and `--draft` (and `--book`) options to run marker placement. Even though the draft file will already have correctly placed markers in it, only the text of the file will be used.

To evaluate the quality of the marker placement, use silnlp.common.compare_usfm_structure, where `gold` is the "draft" with the correctly placed markers and `pred` is the file output by the postprocess_draft script. To only evaluate over the chapters that have corrected marker placements, any chapters that do not have corrected placements need to be removed from the files (script below).

Cut out chapters

In [None]:
from pathlib import Path

file_path = Path("")
out_path = Path("")
chapters = [] # chapters to KEEP
lines = []
skip = True
encoding = "utf-8-sig" # utf-8-sig cp1252
with file_path.open(encoding=encoding) as f:
    for line in f:
        if line.startswith("\\c"):
            skip = int(line.split(" ")[1].strip()) not in chapters
        if not skip or line.startswith("\\id"):
            lines.append(line)

with out_path.open("w", encoding=encoding) as f:
    f.writelines(lines)