# CER tests

Tests with character error rate (CER) computation with the goal of evaluating parts of documents that were generated with hand-written text recognition (HTR). For this purpose we use the [fastwer Python package](https://pypi.org/project/fastwer/) which can compute word error reates and character error rates for similar texts.

## 1. Initialization

In [None]:
# %load_ext pycodestyle_magic
# %pycodestyle_on

In [None]:
from collections import Counter
import copy
import fastwer
import os
import pandas as pd
import regex
from spacy import displacy
import sys
from termcolor import colored
sys.path.append(os.getcwd() + '/..')
from scripts import read_transkribus_files, printed_text, utils

In [None]:
def render_text(text, entities):
    """Display text with colored entities.

    Keyword arguments:
    text -- string
    entities -- list of dicts like: [{"start": 0, "end": 6, "label": "PERSON"}]
    """
    displacy.render({"text": regex.sub("\\n", " ", text),
                    "ents": entities},
                    options={"colors": {"fuzzy_match": "yellow"}},
                    style="ent", manual=True)

In [None]:
# 20240326 new htr model tests
# data_dir_gold = "tmp/1900167/Test_oldDutchess_(corrected)/page"
# data_dir_htr = {"bestmodel": "tmp/1900168/Test_oldDutchess_(corrected)/page",
#                "dutchess": "tmp/1893490/Test_oldDutchess/page",
#                "new_dutchess": "tmp/1893491/Test_newDutchess/page"}

Training data for model 42578 Curacao_Dutchess (278 files):

* Training_extra_0001 - Training_extra_0011 (11 files)
* p001 - p101 (skipped p010, p020, *p021*, p030, *p036*, p040, *p044*, p050, p060, p070, p080, p090, p100) (88 files)
* Sample_regex-0001 - Sample_regex-0100 (skipped: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100) (90 files)
* p055 - p100 (skipped 55, 62, 71, 85, 90) 
* p001 - p054 (skipped 5, 12, 18, 31, 43, 51) - 54: 25-08-1880 577 (89 files)

Validation data (35 files):

- Validation_extra-0001 - Validation_extra-0005 (skipped 3) (4 files)
- p010 - p100 (10 files)
- Sample_regex-0010 - Sample_regex-0100 (10 files)
- p055, 62, 71, 85, 90
- p005, 12, 18, 31, 43, 51 (11 files)

In [None]:
# missing: 7213/72dpi (empty)

data_files = [
    ["10946395", "2483583", "CurTSSR3"],
    ["10946397", "2471264", "ground_truth"],
    ["11035298", "2493743", "CurTSSR7"],
    ["11035339", "2531603", "CurTSSR13"],
    ["11035356", "2535323", "CurTSSR15"],
    ["11035397", "2488423", "CurTSSR4"],
    ["11035399", "2492783", "CurTSSR6"],
    ["11035403", "2530883", "CurTSSR12"],
    ["11035557", "2549503", "CurTSSR16"],
    ["11035560", "2600443", "CurTSSR27"],
    ["11035562", "2619783", "CurTSSR5_NB"],
    ["11035576", "2550123", "CurTSSR17"],
    ["11035594", "2568143", "CurTSSR20"],
    ["11035653", "2546503", "CurTSSR18"],
    ["11035655", "2568883", "CurTSSR21"],
    ["11035656", "2584923", "CurTSSR23"],
    ["11035673", "2599743", "CurTSSR25"],
    ["11035675", "2623583", "CurTSSR15_NB"],
    ["11035676", "2637703", "CurTSSR25_NB"],
    ["11035692", "2600203", "CurTSSR26"],
    ["11035715", "2637163", "CurTSSR20_NB"],
    ["11035754", "2622863", "CurTSSR10_NB"],
    ["11036995", "3194127", "144dpi"],
    ["11037173", "3194114", "35dpi"],
    ["11037174", "3194504", "300dpi"],
    ["11037182", "2471264_dpi", "dpi_ground_truth"],
]

In [None]:
data_files_df = pd.DataFrame(data_files, columns=["export_id", "experiment_id", "experiment_name"])

In [None]:
data_dir_gold = "202407/2471264_dpi/Training_set_small/page"
data_dir_htr = {}
for index, row in data_files_df.iterrows():
    if regex.search('[0-9]dpi', row["experiment_name"]):
        # data_dir_htr[row["experiment_name"]] = f"202407/{row['experiment_id']}/TRAINING_VALIDATION_SET_{row['experiment_name']}/page"
        data_dir_htr[row["experiment_name"]] = regex.sub("dpi", "DPI", f"202407/{row['experiment_id']}/Cur_Scanqualtest_{row['experiment_name']}/page")

In [None]:
data_dir_htr

## 2. Read files

In [None]:
def read_files(data_dir_gold, data_dir_htr):
    """Read gold and htr data and return results as dicts."""
    (texts_gold,
     metadata_gold,
     textregions_gold) = read_transkribus_files.read_files(data_dir_gold)
    (texts_htr,
     metadata_htr,
     textregions_htr) = read_transkribus_files.read_files(data_dir_htr)
    for file_name in texts_htr:
        texts_htr[file_name] = "".join(texts_htr[file_name])
    for file_name in texts_gold:
        texts_gold[file_name] = "".join(texts_gold[file_name])
    return texts_gold, texts_htr

In [None]:
def word_count_text_list(texts):
    """Return number of tokens in text."""
    return len([token for text_id in texts.keys()
                for token in texts[text_id].split()
                if regex.search("[a-zA-Z0-9]", token)])

In [None]:
texts_gold, texts_htr = read_files(data_dir_gold, data_dir_htr["CurTSSR26"])

In [None]:
word_count_text_list(texts_gold)

In [None]:
word_count_text_list(texts_htr)

## 3. Compute character error rates (CER) per text

The character "y" (should be "ij") and characters with accents (accents should be removed) are forbidden in the text annotations so we replace them before checking the texts. We also remove punctuation because we do not want to include them in the CER computations.

In [None]:
def check_text_for_illegal_characters(text):
    """Check text for non-ascii characters."""
    for line in text.split("\n"):
        if regex.search("[^a-xzA-Z0-9/ ]", line):
            print(f"illegal character found on line: {line}")

In [None]:
def remove_illegal_characters(text):
    """Replace known non-ascii characters by recommended alternatives."""
    text = regex.sub("y", "ij", text)
    text = regex.sub("ä", "a", text)
    text = regex.sub("à", "a", text)
    text = regex.sub("å", "a", text)
    text = regex.sub("ç", "c", text)
    text = regex.sub("é", "e", text)
    text = regex.sub("è", "e", text)
    text = regex.sub("ë", "e", text)
    text = regex.sub("ñ", "n", text)
    text = regex.sub("ó", "o", text)
    text = regex.sub("ö", "o", text)
    text = regex.sub("ø", "o", text)
    text = regex.sub("ü", "u", text)
    text = regex.sub("ú", "u", text)
    text = regex.sub("[?.,!;():'" + '"-]', "", text)
    text = regex.sub("\\[", "", text)
    text = regex.sub("\\]", "", text)
    return text

In [None]:
def remove_illegal_characters_and_check(texts):
    """Remove non-ascii characters and check texts for other characters."""
    for texts_index in texts:
        texts[texts_index] = remove_illegal_characters(texts[texts_index])
        check_text_for_illegal_characters(texts[texts_index])
    return texts

In [None]:
def make_old_cer_string(cer, old_cer_dict, file_name, guessed_years):
    """Intentionally left blank."""
    old_cer_string = ""
    if file_name in guessed_years:
        old_cer_string = (str(guessed_years[file_name]) +
                          check_guessed_year_order(guessed_years, file_name) +
                          " ")
    if file_name in old_cer_dict and old_cer_dict[file_name] != cer:
        cer_diff = round(cer - old_cer_dict[file_name], 1)
        cer_sign = "+" if cer_diff > 0 else ""
        old_cer_string += f"({cer_sign}{cer_diff})"
    return old_cer_string

In [None]:
def check_guessed_year_order(guessed_years, file_name_target):
    """Check if correct document year is in list of guessed years."""
    last_year = 0
    for file_name in sorted(guessed_years.keys()):
        if file_name == file_name_target:
            if last_year <= guessed_years[file_name_target]:
                return " "
            else:
                return "!"
        last_year = guessed_years[file_name]
    print("check_guessed_year_order: cannot happen")
    return " "

In [None]:
def compute_cer_per_text(texts_htr,
                         texts_gold,
                         old_cer_dict={},
                         guessed_years={}):
    """Compute character error rate per text for texts in input list."""
    total_cer = 0
    total_chars = 0
    cer_dict = {"average_cer": 100}
    texts_gold_clean = remove_illegal_characters_and_check(texts_gold)
    texts_htr_clean = remove_illegal_characters_and_check(texts_htr)
    for file_name in sorted(texts_htr_clean.keys()):
        cer = round(fastwer.score_sent(texts_htr_clean[file_name],
                                       texts_gold_clean[file_name],
                                       char_level=True), 1)
        max_text_length = max(len(texts_htr_clean[file_name]),
                              len(texts_gold_clean[file_name]))
        total_chars += max_text_length
        total_cer += max_text_length * cer
        old_cer_string = make_old_cer_string(cer,
                                             old_cer_dict,
                                             file_name,
                                             guessed_years)
        cer_dict[file_name] = cer
    cer_dict["average_cer"] = round(total_cer/total_chars, 1)
    old_cer_string = make_old_cer_string(cer_dict["average_cer"],
                                         old_cer_dict,
                                         "average_cer",
                                         guessed_years)
    return cer_dict

In [None]:
results_dict = {}
for experiment_name in sorted(data_dir_htr.keys(),
                              key=lambda x: int(regex.sub("dpi", "",
                                                regex.sub("_NB", "",
                                                regex.sub("CurTSSR", "", x))))):
    texts_gold, texts_htr = read_files(data_dir_gold,
                                       data_dir_htr[experiment_name])
    cer_dict = compute_cer_per_text(texts_htr, texts_gold)
    print(f"{experiment_name:>9}: cer = {cer_dict['average_cer']}")
    results_dict[experiment_name] = cer_dict.copy()

In [None]:
pd.DataFrame(results_dict).T.to_csv("cers_per_experiment.csv")

In [None]:
print("Character error rates for Loghi region tests:")
pd.DataFrame([{"dataset": "CBAD",
               "order": [0, 1, 2, 3, 6, 7, 5, 4],
               "#order": 8,
               "cer": 59},
              {"dataset": "general",
               "order": [0, 6, 1, 2, 3, 5, 4],
               "#order": 7,
               "cer": 22},
              {"dataset": "republic",
               "order": [0, 10, 1, 2, 9, 3, 4, 5, 8, 6, 7],
               "#order": 11,
               "cer": 38},
              {"dataset": "republicprint",
               "order": [0, 32, 27, 1, 23, 12, 7, 16, 10, 2, 3, 4, 5, 6, 8, 9,
                         11, 13, 14, 15, 17, 18, 19, 20, 21, 24, 25, 26, 28,
                         29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41],
               "#order": 42,
               "cer": 62}])

## 4. Correct printed text

In [None]:
def check_century(tokens, tokens_index, text_id):
    """ check if the current token is part of a century; return the value """
    centuries = {"zeventienhonderd": 1700,
                 "achttienhonderd": 1800,
                 "negentienhonderd": 1900}
    if tokens[tokens_index] == "honderd" and tokens_index > 0:
        if tokens[tokens_index - 1] in ["zeven", "zeventien"]:
            year = 1700
        elif tokens[tokens_index - 1] in ["acht", "achttien"]:
            year = 1800
        elif tokens[tokens_index - 1] in ["negen", "negentien"]:
            year = 1900
        else:
            print(f"unexpected token before \"honderd\" in file {text_id}:" +
                  " {tokens[tokens_index - 1]}")
            year = 0
    elif tokens[tokens_index] in list(centuries.keys()):
        year = centuries[tokens[tokens_index]]
    else:
        year = 0
    return year

In [None]:
def guess_year_from_text(text, text_id):
    """ find all years in the text and return the first; otherwise return 0 """
    years = []
    numbers = {"een": 1, "twee": 2, "drie": 3, "vier": 4, "vijf": 5,
               "zes": 6, "zeven": 7, "acht": 8, "negen": 9, "tien": 10,
               "elf": 11, "twaalf": 12, "dertien": 13, "veertien": 14,
               "vijftien": 15, "zestien": 16, "zeventien": 17, "achttien": 18,
               "negentien": 19,
               "twintig": 20, "dertig": 30, "veertig": 40, "vijftig": 50,
               "zestig": 60, "zeventig": 70, "tachtig": 80, "negentig": 90}

    tokens = text.lower().split()
    for tokens_index in range(0, len(tokens)):
        year = check_century(tokens, tokens_index, text_id)
        if year > 0:
            if (tokens_index < len(tokens) - 1 and
               tokens[tokens_index + 1] in numbers):
                year += numbers[tokens[tokens_index + 1]]
            if (tokens_index < len(tokens) - 2 and
               tokens[tokens_index + 1] == "en" and
               tokens[tokens_index + 2] in numbers):
                year += numbers[tokens[tokens_index + 2]]
            if (tokens_index < len(tokens) - 3 and
               tokens[tokens_index + 2] == "en" and
               tokens[tokens_index + 3] in numbers):
                year += numbers[tokens[tokens_index + 3]]
            if (tokens_index < len(tokens) - 4 and
               tokens[tokens_index + 3] == "en" and
               tokens[tokens_index + 4] in numbers):
                year += numbers[tokens[tokens_index + 4]]
            years.append(year)
    if len(years) == 0:
        return 0
    elif len(years) == 1:
        return years[0]
    else:
        if years[0] >= years[1] - 1:
            return years[0]
        else:
            return years[1]

In [None]:
def guess_years(texts):
    return {file_name: guess_year_from_text(texts_htr[file_name],
            file_name) for file_name in texts_htr}

In [None]:
def get_printed_text_year(text_year):
    """
        find appropriate index of text format
        printed_text.PRINTED_TEXT for a certificate
    """
    printed_text_year = list(printed_text.PRINTED_TEXT.keys())[0]
    for year in sorted(printed_text.PRINTED_TEXT.keys()):
        if year > printed_text_year and text_year >= year:
            printed_text_year = year
    return printed_text_year

In [None]:
def same_number_of_words(phrase, search_text, positions):
    """
        check if the proposed replacement phrase has
        the same number of words as the original
    """
    guessed_phrase = search_text[positions[0].start(): positions[0].end()]
    return len(guessed_phrase.split()) == len(phrase.split())

In [None]:
def find_match(text, phrase, start=0, end=None, level=0, max_diff=3):
    """ find an approximately matching phrase in the text """
    match = {}
    if (phrase.lower() in SKIP_PHRASES_ALWAYS or
        (phrase.lower() in SKIP_PHRASES_AT_START and
         start == 0 and
         end is None)):
        return match
    search_text = text[start: end]
    if len(phrase) > 2 - level:
        positions = utils.find_text_patterns(r'\b' +
                                             phrase.lower() +
                                             r'\b', search_text.lower())
        if (len(positions) == 1 and
            (positions[0]["start"] == 0 or
             not regex.search("[a-z]",
             search_text[positions[0]["start"]-1].lower())) and
            (positions[0]["end"] == len(search_text) or
             not regex.search("[a-z]",
                              search_text[positions[0]["end"]].lower()))):
            positions[0]["label"] = phrase
            match = {"start": positions[0]["start"] + start,
                     "end": positions[0]["end"] + start,
                     "label": "match",
                     "correct_phrase": phrase}
        elif len(positions) == 0:
            character_errors = 0
            while len(positions) == 0 and character_errors <= max_diff:
                query = (r'\b' +
                         f"({phrase.lower()})" +
                         "{" +
                         f"e<={character_errors}"+"}" +
                         r'\b')
                positions = [match for match in
                             regex.finditer(query, search_text.lower())]
                character_errors += 1
            if (len(positions) == 1 and
               same_number_of_words(phrase, search_text, positions)):
                match = {"start": positions[0].start() + start,
                         "end": positions[0].end() + start,
                         "label": "fuzzy_match",
                         "correct_phrase": phrase}
                if (positions[0].group()[0] == " " or
                   positions[0].group()[0] == "\n"):
                    match["start"] += 1
                if (positions[0].group()[-1] == " " or
                   positions[0].group()[-1] == "\n"):
                    match["end"] -= 1
                if positions[0].groups()[0] == phrase:
                    match["label"] = "match"
    return match


SKIP_PHRASES_AT_START = ["des jaars een duizend acht honderd",
                         "op dit eiland", "op den"]
SKIP_PHRASES_ALWAYS = ["middags te", "middags", "een"]

In [None]:
def find_phrases_in_text(text, phrases, max_diff=0):
    """ find phrases in text, only return unique matches """
    entities = []
    start = 0
    for phrase in phrases:
        if len(entities) > 0:
            start = get_min_char_pos(entities, len(entities))
        if len(phrase) < 6 or phrases.count(phrase) > 1:
            entities.append({})
        else:
            entities.append(find_match(text,
                                       phrase, start=start,
                                       max_diff=max_diff))
    return entities

In [None]:
def get_min_char_pos(entities, index):
    """ get the final position of the last preceding phrase with a match """
    for counter in range(index-1, 0, -1):
        if "end" in entities[counter]:
            return entities[counter]["end"] + 1
    return 0

In [None]:
def get_max_char_pos(entities, index):
    """ get the first position of the first next phrase with a match """
    for counter in range(index+1, len(entities)):
        if "start" in entities[counter]:
            return entities[counter]["start"]
    return None

In [None]:
def find_phrases_in_text_with_entities(text, phrases, entities, max_diff=3):
    """ find phrases in text, only return unique matches """
    if len(entities) != len(phrases):
        print(f"find_phrases_in_text_with_entities: list entities " +
              "{len(entities)} should have same length as phrases " +
              "{len(phrases)}")
    for i in range(0, len(phrases)):
        if len(entities[i]) == 0:
            start = get_min_char_pos(entities, i)
            end = get_max_char_pos(entities, i)
            if end is None:
                entities[i] = find_match(text,
                                         phrases[i],
                                         start=start,
                                         level=1)
            else:
                entities[i] = find_match(text,
                                         phrases[i],
                                         start=start,
                                         end=end,
                                         level=1,
                                         max_diff=max_diff)
    return entities

In [None]:
def printed_text_split_in_words(printed_text_in, entities_in):
    """ split expected phrase in word before searching word-by-word """
    printed_text_out = []
    entities_out = []
    for i in range(0, len(printed_text_in)):
        if (len(entities_in[i]) > 0 or
           len(printed_text_in[i].split()) == 1 or
           printed_text_in[i].lower() in NO_SPLIT_PHRASES):
            printed_text_out.append(printed_text_in[i])
            entities_out.append(entities_in[i])
        else:
            # ideally not the first word of a phrase should be linked
            # first but the longest
            for word in printed_text_in[i].split():
                printed_text_out.append(word)
                entities_out.append({})
    return printed_text_out, entities_out


NO_SPLIT_PHRASES = ["des jaars een duizend acht honderd",
                    "laatstelijk gewoond",
                    "niet te kunnen schrijven",
                    "op dit eiland",
                    "middags te",
                    "middags"]

In [None]:
def sanity_check_entities(entities, text_id):
    starts_seen = {}
    ends_seen = {}
    for entity in entities:
        if "start" in entity:
            if entity["start"] in starts_seen:
                utils.print_with_color(f"duplicate start: " +
                                       "{entity['start']} for text_id " +
                                       "{text_id}! {entity}\n")
            if entity["end"] in ends_seen:
                utils.print_with_color(f"duplicate end: " +
                                       "{entity['end']} for text_id " +
                                       "{text_id}! {entity}\n")
            starts_seen[entity["start"]] = True
            ends_seen[entity["end"]] = True

In [None]:
def update_entities(entities, entity_replaced):
    """ adjust start and end point of entities after replacing a text """
    delta = (len(entity_replaced["correct_phrase"]) -
             (entity_replaced["end"] -
              entity_replaced["start"]))
    for entity in entities:
        if "start" in entity and entity["start"] > entity_replaced["start"]:
            entity["start"] += delta
        if "end" in entity and entity["end"] >= entity_replaced["end"]:
            entity["end"] += delta
    return entities

In [None]:
def correct_text(text_in, entities):
    """ replace fuzzy matches in text by correct phrases """
    text_out = text_in
    for entity in reversed(entities):
        if "label" in entity and entity["label"] == "fuzzy_match":
            text_out = (text_out[:entity["start"]] +
                        entity["correct_phrase"] +
                        text_out[entity["end"]:])
            if len(entity["correct_phrase"]) != (entity["end"] -
                                                 entity["start"]):
                entities = update_entities(entities, entity)
    return text_out, entities

In [None]:
def check_printed_text_for_illegal_characters(printed_text):
    for year in printed_text.PRINTED_TEXT:
        text = "\n".join(printed_text.PRINTED_TEXT[year])
        check_text_for_illegal_characters(text)

In [None]:
def correct_text_list(texts):
    corrected_texts = {}
    text_entities = {}
    guessed_years = {}
    for text_id in sorted(texts.keys()):
        guessed_years[text_id] = guess_year_from_text(texts[text_id], text_id)
        printed_text_year = get_printed_text_year(guessed_years[text_id])
        printed_text_text = [remove_illegal_characters(phrase)
                             for phrase in
                             printed_text.PRINTED_TEXT[printed_text_year]]
        entities = find_phrases_in_text(texts[text_id],
                                        printed_text_text,
                                        max_diff=0)
        entities = find_phrases_in_text_with_entities(texts[text_id],
                                                      printed_text_text,
                                                      entities, max_diff=0)
        entities = find_phrases_in_text_with_entities(texts[text_id],
                                                      printed_text_text,
                                                      entities)
        (printed_text_text,
         entities) = printed_text_split_in_words(printed_text_text,
                                                 entities)
        entities = find_phrases_in_text_with_entities(texts[text_id],
                                                      printed_text_text,
                                                      entities)
        entities = find_phrases_in_text_with_entities(texts[text_id],
                                                      printed_text_text,
                                                      entities)
        sanity_check_entities(entities, text_id)
        text_entities[text_id] = entities
        (corrected_texts[text_id],
         entities) = correct_text(texts[text_id],
                                  copy.deepcopy(entities))
    return corrected_texts, guessed_years

In [None]:
def correct_guessed_years(guessed_years, data_dir_gold):
    if data_dir_gold.split("/")[2] == "Sample_test_1":
        file_years = pd.read_csv("Sample_test_1_years.csv")
        file_names = list(guessed_years.keys())
        for index in range(0, len(file_names)):
            if guessed_years[file_names[index]] != file_years.iloc[index][1]:
                print(f"changed year of {file_names[index]} from " +
                      "{guessed_years[file_names[index]]} " +
                      "to {file_years.iloc[index][1]}")
                guessed_years[file_names[index]] = file_years.iloc[index][1]
    return guessed_years

In [None]:
corrected_texts, guessed_years = correct_text_list(texts_htr)

In [None]:
guessed_years = correct_guessed_years(guessed_years, data_dir_gold)

In [None]:
cer_dict_corrected = compute_cer_per_text(corrected_texts,
                                          texts_gold,
                                          cer_dict,
                                          guessed_years)

Average CER after correction of printed text (5.7%) was higher than before correction (5.5%)! How can this have happened?
1. gold data might be incorrect: yes, it contains y, ç, é, ë, ó (**fixed**)
2. printed text targets might be incorrect: yes, it contained ç (**fixed**)
3. punctuation is included in CER computation (**fixed**)
4. printed text "middags te" be comes "namiddags te" with written text and is corrected back (**fixed**)
6. these old validation test files are probably in the training data of the new model: not many printed text errors left to be corrected
7. **validation test file p010.xml is a birth certificate**
8. p040.xml: strike-through in printed text

## 5. Find incorrect words

In [None]:
def find_best_line_match(lines_htr,
                         lines_gold,
                         index_htr,
                         gold_index_used,
                         alignments):
    best_index_gold = -1
    best_cer = 100
    line_htr = lines_htr[index_htr]
    index_htr_old = -1
    for index_gold in range(0, len(lines_gold)):
        cer = fastwer.score_sent(
            remove_illegal_characters(line_htr),
            remove_illegal_characters(lines_gold[index_gold]),
            char_level=True)
        if cer < best_cer and (index_gold not in gold_index_used.keys() or
                               cer < gold_index_used[index_gold][0]):
            best_cer = cer
            best_index_gold = index_gold
            if index_gold in gold_index_used:
                index_htr_old = gold_index_used[index_gold][1]
            gold_index_used[index_gold] = [best_cer, index_htr]
            if index_htr_old >= 0:
                (best_index_gold_old,
                 best_cer_old,
                 gold_index_used,
                 alignments) = find_best_line_match(lines_htr,
                                                    lines_gold,
                                                    index_htr_old,
                                                    gold_index_used,
                                                    alignments)
                alignments[index_htr_old] = [best_index_gold_old, best_cer_old]
    alignments[index_htr] = [best_index_gold, best_cer]
    return best_index_gold, best_cer, gold_index_used, alignments

In [None]:
def align_lines(lines_htr, lines_gold):
    alignments = {}
    gold_index_used = {}
    for index_htr in range(0, len(lines_htr)):
        (best_index_gold,
         best_cer,
         gold_index_used,
         alignments) = find_best_line_match(lines_htr,
                                            lines_gold,
                                            index_htr,
                                            gold_index_used,
                                            alignments)
    alignments = convert_alignments(alignments)
    return alignments

In [None]:
def count_distances(alignments):
    return Counter([alignment[1] - alignment[0] for alignment in alignments])

In [None]:
def check_alignments_order(alignments, lines_htr, lines_gold):
    last_gold_index = -1
    to_be_deleted = []
    distances = count_distances(alignments)
    for alignment_index in range(0, len(alignments)):
        if last_gold_index < alignments[alignment_index][1]:
            last_gold_index = alignments[alignment_index][1]
        elif (alignments[alignment_index][2] >
              alignments[alignment_index - 1][2]):
            to_be_deleted.append(alignment_index)
        elif (alignments[alignment_index][2] <
              alignments[alignment_index - 1][2]):
            to_be_deleted.append(alignment_index - 1)
            last_gold_index = alignments[alignment_index][1]
        elif (distances[alignments[alignment_index][1] -
              alignments[alignment_index][0]] <
              distances[alignments[alignment_index - 1][1] -
              alignments[alignment_index - 1][0]]):
            to_be_deleted.append(alignment_index)
        else:
            to_be_deleted.append(alignment_index - 1)
            last_gold_index = alignments[alignment_index][1]
    for to_be_deleted_value in list(reversed(to_be_deleted)):
        alignments.pop(to_be_deleted_value)
    to_be_added = []
    for alignment_index in range(1, len(alignments)):
        if ((alignments[alignment_index][0] -
             alignments[alignment_index - 1][0] ==
             alignments[alignment_index][1] -
             alignments[alignment_index - 1][1]) and
            alignments[alignment_index][0] -
            alignments[alignment_index - 1][0] != 1):
            range_end = (alignments[alignment_index][0] -
                         alignments[alignment_index - 1][0])
            for alignment_index_delta in range(1, range_end):
                to_be_added.append((alignment_index,
                                    alignments[alignment_index - 1][0] +
                                    alignment_index_delta,
                                    alignments[alignment_index - 1][1] +
                                    alignment_index_delta))
    for to_be_added_element in list(reversed(to_be_added)):
        e1clean = remove_illegal_characters(lines_htr[to_be_added_element[1]])
        e2clean = remove_illegal_characters(lines_gold[to_be_added_element[2]])
        alignments.insert(to_be_added_element[0],
                          (to_be_added_element[1],
                           to_be_added_element[2],
                           fastwer.score_sent(
                           e1clean,
                           e2clean,
                           char_level=True)))
    return len(to_be_deleted) > 0 or len(to_be_added) > 0, alignments

In [None]:
def check_alignments_order_wrapper(alignments, lines_htr, lines_gold):
    alignments_changed = True
    while alignments_changed:
        alignments_changed, alignments = check_alignments_order(alignments,
                                                                lines_htr,
                                                                lines_gold)
    return alignments

In [None]:
def fix_split_words(words_htr, words_gold, wrong_words, missed_words):
    to_be_deleted = []
    for index_wrong in range(1, len(wrong_words)):
        if wrong_words[index_wrong] == wrong_words[index_wrong - 1] + 1:
            combined_word = (words_htr[wrong_words[index_wrong - 1]] +
                             words_htr[wrong_words[index_wrong]])
            for index_missed in range(0, len(missed_words)):
                if words_gold[missed_words[index_missed]] == combined_word:
                    to_be_deleted.append((index_wrong - 1,
                                          index_wrong,
                                          index_missed))
                    break
    for to_be_deleted_item in list(reversed(to_be_deleted)):
        for to_be_deleted_wrong_index in range(to_be_deleted_item[1],
                                               to_be_deleted_item[0] - 1,
                                               -1):
            wrong_words.pop(to_be_deleted_wrong_index)
        missed_words.pop(to_be_deleted_item[2])
    return wrong_words, missed_words

In [None]:
def analyze_words(line_htr, line_gold, lines_htr, lines_gold):
    missed_words = []
    wrong_words = []
    if line_htr.lower() != line_gold.lower():
        words_htr = line_htr.lower().split()
        words_gold = line_gold.lower().split()
        alignments = align_lines(words_htr, words_gold)
        alignments = check_alignments_order_wrapper(alignments,
                                                    lines_htr,
                                                    lines_gold)
        index_htr = 0
        index_gold = 0
        for index_alignment in range(0, len(alignments)):
            target_index_htr = alignments[index_alignment][0]
            target_index_gold = alignments[index_alignment][1]
            while (index_htr < len(words_htr) and
                   index_htr < target_index_htr):
                wrong_words.append(index_htr)
                index_htr += 1
            while (index_gold < len(words_gold) and
                   index_gold < target_index_gold):
                missed_words.append(index_gold)
                index_gold += 1
            if words_htr[target_index_htr] != words_gold[target_index_gold]:
                missed_words.append(target_index_gold)
                wrong_words.append(index_htr)
            index_htr += 1
            index_gold += 1
        for index_htr_extra in range(index_htr, len(words_htr)):
            wrong_words.append(index_htr)
        for index_gold_extra in range(index_gold, len(words_gold)):
            missed_words.append(index_gold_extra)
        wrong_words, missed_words = fix_split_words(words_htr,
                                                    words_gold,
                                                    wrong_words,
                                                    missed_words)
    return wrong_words, missed_words

In [None]:
def analyze_lines(lines_htr, lines_gold, alignments):
    index_htr = 0
    line_analysis = []
    for alignment in alignments:
        for index_htr_delta in range(1, alignment[0]-index_htr):
            line_analysis.append(analyze_words(lines_htr[index_htr +
                                               index_htr_delta],
                                               "",
                                               lines_htr,
                                               lines_gold))
        line_analysis.append(analyze_words(lines_htr[alignment[0]],
                                           lines_gold[alignment[1]],
                                           lines_htr, lines_gold))
        index_htr = alignment[0] + 1
    for index_htr_delta in range(1, len(lines_htr)-index_htr):
        line_analysis.append(analyze_words(lines_htr[index_htr +
                                           index_htr_delta],
                                           "",
                                           lines_htr,
                                           lines_gold))
    return line_analysis

In [None]:
def show_word_analysis(line_htr, line_gold, line_analysis_line):
    words_htr = line_htr.split()
    words_gold = line_gold.split()
    for index_htr in range(0, len(words_htr)):
        if index_htr in line_analysis_line[0]:
            read_transkribus_files.print_with_color(words_htr[index_htr],
                                                    color_code=1,
                                                    end=" ")
        else:
            read_transkribus_files.print_with_color(words_htr[index_htr],
                                                    color_code=0,
                                                    end=" ")
    if len(line_analysis_line[1]) > 0:
        read_transkribus_files.print_with_color([words_gold[index_gold]
                                                for index_gold in
                                                line_analysis_line[1]],
                                                color_code=4,
                                                end=" ")
    print()

In [None]:
def show_line_analysis(lines_htr, lines_gold, alignments, line_analysis):
    index_htr = 0
    for alignment in alignments:
        for index_htr_extra in range(index_htr, alignment[0]):
            show_word_analysis(lines_htr[index_htr_extra], "", [[], []])
        show_word_analysis(lines_htr[alignment[0]],
                           lines_gold[alignment[1]],
                           line_analysis[alignment[0]])
        index_htr = alignment[0] + 1
    for index_htr_extra in range(index_htr, len(lines_htr)):
        show_word_analysis(lines_htr[index_htr_extra], "", [[], []])

In [None]:
def convert_alignments(alignments):
    alignments_converted = []
    for alignments_key in alignments.keys():
        alignments_converted.append((alignments_key,
                                     alignments[alignments_key][0],
                                     alignments[alignments_key][1]))
    return alignments_converted

In [None]:
def compare_text(texts_htr, texts_gold, file_name):
    lines_htr = remove_illegal_characters(texts_htr[file_name]).split("\n")
    lines_gold = remove_illegal_characters(texts_gold[file_name]).split("\n")
    alignments = align_lines(lines_htr, lines_gold)
    cer = fastwer.score_sent("\n".join(lines_htr),
                             "\n".join(lines_gold),
                             char_level=True)
    read_transkribus_files.print_with_color(f"{file_name} (cer={cer:.1f}):\n",
                                            color_code=4)
    line_analysis = analyze_lines(lines_htr, lines_gold, alignments)
    show_line_analysis(lines_htr, lines_gold, alignments, line_analysis)

In [None]:
def compare_texts(texts_htr, texts_gold):
    for file_name in sorted(texts_htr.keys()):
        compare_text(texts_htr, texts_gold, file_name)

In [None]:
for file_id in texts_htr.keys():
    compare_text(texts_htr, texts_gold, file_id)

In [None]:
selected_file = "0043_p037.xml"

In [None]:
compare_text(corrected_texts, texts_gold, selected_file)

In [None]:
compare_text(corrected_texts, texts_htr, selected_file)

In [None]:
print(selected_file)
texts_htr[selected_file] == corrected_texts[selected_file]

In [None]:
corrected_texts[selected_file]

In [None]:
texts_htr[selected_file]

## 6. Named entity analysis

Install Dutch Spacy model `nl_core_news_sm` with: `python -m spacy download nl`

In [None]:
import spacy

In [None]:
nlp = spacy.load("nl_core_news_sm")

In [None]:
def get_token_positions(tokens):
    return [{"start": token.idx,
             "end": token.idx + len(token)}
            for token in tokens]

In [None]:
def get_spacy_entities(analysis):
    token_positions = get_token_positions([token for token in analysis])
    return [{"start": token_positions[entity.start]["start"],
             "end": token_positions[entity.end - 1]["end"],
             "label": str(entity.label_)}
            for entity in analysis.ents]

In [None]:
def select_entities(entities_in, entities_lower_cased, text):
    entities_out = [entity for entity in entities_in
                    if entity["label"] == "PERSON"]
    to_be_deleted = []
    for index in range(0, len(entities_out)):
        for entity in entities_lower_cased:
            if (entities_out[index]["start"] == entity["start"] and
                entities_out[index]["end"] == entity["end"]):
                to_be_deleted.append(index)
                break
    for index in range(len(to_be_deleted) - 1, -1, -1):
        entities_out.pop(to_be_deleted[index])
    return entities_out

In [None]:
def get_previous_token(text, position):
    index = position
    while index > 0 and regex.search("\\s", text[index - 1]):
        index -= 1
    previous_token = ""
    while index > 0 and regex.search("\\S", text[index - 1]):
        index -= 1
        previous_token = text[index] + previous_token
    return previous_token, index

In [None]:
def get_next_token(text, position):
    index = position
    while (index < len(text) and
           (regex.search("\\s", text[index]) or
            text[index] == ",")):
        index += 1
    next_token = ""
    while index < len(text) and regex.search("\\S", text[index]):
        next_token += text[index]
        index += 1
    return next_token, index

In [None]:
def expand_entities(entities, text):
    for entity in entities:
        previous_token, previous_start = get_previous_token(text,
                                                            entity["start"])
        while (regex.search("^[A-Z]", previous_token) and
               previous_token not in SKIP_TOKENS):
            entity["start"] = previous_start
            (previous_token,
             previous_start) = get_previous_token(text, entity["start"])
        next_token, next_end = get_next_token(text, entity["end"])
        while (regex.search("^[A-Z]", next_token) and
               next_token not in SKIP_TOKENS):
            entity["end"] = next_end
            next_token, next_end = get_next_token(text, entity["end"])
    return entities


SKIP_TOKENS = ["Oud", "En", "Een", "Twee", "Drie", "Vier", "Vijf", "Zes",
               "Zeven", "Acht", "Negen", "Tien", "in",
               "Ongehuwd", "Aanteekeningen.", "Aanteekeningen",
               "Verbeteringen.", "Hospitaal", "Compareerden", "De", "Sep",
               "No.", "Nr.", "Fol.", "Werk", "Heden", "Waarvan", "Jaars",
               "des", "January", "Januari", "February", "Februari",
               "Maart", "April", "Mei", "Juni", "Juli", "July", "Augustus",
               "September", "October", "November", "December",
               "Zeventienden",  "Zestienden", "Curaçao", "Curacao", "Habana",
               "Achthonderd", "Negenhonderd", "en",
               "Twintig", "Dertig", "Veertig", "Vijftig", "Zestig",
               "Zeventig", "Tachtig", "Negentig", "Honderd"]

In [None]:
def shrink_entities(entities, text):
    for entity in entities:
        final_token, next_end = get_previous_token(text, entity["end"])
        while (regex.search("^[a-z]", final_token) or
               final_token in SKIP_TOKENS or
               regex.search("[0-9]", final_token)):
            entity["end"] = next_end
            final_token, next_end = get_previous_token(text, entity["end"])
        first_token, next_start = get_next_token(text, entity["start"])
        while (regex.search("^[a-z]", first_token) or
               first_token in SKIP_TOKENS or
               regex.search("[0-9]", first_token)):
            entity["start"] = next_start
            first_token, next_start = get_next_token(text, entity["start"])
    return entities

In [None]:
def remove_empty_entities(entities):
    to_be_deleted = []
    for index in range(0, len(entities)):
        if entities[index]["end"] <= entities[index]["start"]:
            to_be_deleted.append(index)
    to_be_deleted.reverse()
    for index in to_be_deleted:
        entities.pop(index)
    return entities

In [None]:
def remove_duplicate_entities(entities):
    to_be_deleted = []
    seen = []
    for index in range(0, len(entities)):
        if len([True
                for index2 in range(0, len(entities))
                if (index2 != index and
                    entities[index2]["start"] >=
                    entities[index]["start"] - 1 and
                    entities[index2]["start"] <=
                    entities[index]["start"] + 1 and
                    entities[index2]["end"] >=
                    entities[index]["end"] - 2 and
                    entities[index2]["end"] <=
                    entities[index]["end"] + 2)]) > 0:
            to_be_deleted.append(index)
        seen.append((entities[index]["start"], entities[index]["end"]))
    to_be_deleted.reverse()
    for index in to_be_deleted:
        entities.pop(index)
    for index in range(0, len(entities)):
        if len([True
                for index2 in range(0, len(entities))
                if (index2 != index and
                    entities[index2]["start"] <= entities[index]["start"] and
                    entities[index2]["end"] >= entities[index]["start"])]) > 0:
            print(f"warning: overlapping entities [1] {index}")
        elif len([True
                  for index2 in range(0, len(entities))
                  if (index2 != index and
                      entities[index2]["start"] <= entities[index]["end"] and
                      entities[index2]["end"] >= entities[index]["end"])]) > 0:
            print(f"warning: overlapping entities [2] {index}")
        elif len([True
                  for index2 in range(0, len(entities))
                  if (index2 != index and
                      entities[index2]["start"] >= entities[index]["start"] and
                      entities[index2]["end"] <= entities[index]["end"])]) > 0:
            print(f"warning: overlapping entities [3] {index}")
    return entities

In [None]:
def check_name_triggers(entities, text):
    for name_trigger_text in NAME_TRIGGER_TEXTS:
        result = regex.search(name_trigger_text, text, regex.IGNORECASE)
        if result is not None and result.span()[1] >= 0:
            next_token, next_end = get_next_token(text, result.span()[1])
            if regex.search("^[A-Z]", next_token):
                token_start = next_end - len(next_token)
                token_used = len([entity
                                  for entity in entities
                                  if entity["start"] == token_start]) > 0
                if not token_used:
                    entities.extend(expand_entities([{"start": token_start,
                                                      "end": next_end,
                                                      "label": "PERSON"}],
                                                    text))
    return entities


NAME_TRIGGER_TEXTS = ["De personen van", "overleden is:",
                      "alhier is overleden", "stand op Curaçao",
                      "stand op Curacao", "zoon van", "dochter van",
                      "gehuwd met"]

In [None]:
def sanity_check_entities(entities, text):
    entities = remove_empty_entities(entities)
    entities = check_name_triggers(entities, text)
    for name in NAMES:
        result = regex.search(name, text)
        if result is not None and result.span()[0] >= 0:
            token_start = result.span()[0]
            token_end = result.span()[1]
            token_used = len([entity
                              for entity in entities
                              if (entity["start"] == token_start or
                                  entity["end"] == token_end)]) > 0
            if not token_used:
                entities.extend(expand_entities([{"start": token_start,
                                                  "end": token_end,
                                                  "label": "PERSON"}],
                                                text))
    entities = remove_duplicate_entities(entities)
    return entities


NAMES = ["Hermanus", "Jacob", "Theodorus", "José", "Maria", "Eduard",
         "Cardino", "Anton", "Martino", "Modest", "Martina", "Johannes",
         "Pedro", "Adelaida", "Luis", "Lindoro", "Manuel"]

In [None]:
def evaluate_entities(texts_gold, texts_htr, entities_dict,
                      htr_model, debug=False):
    total_length = 0
    total_cer = 0
    found = 0
    correct = 0
    total = 0
    results = []
    for file_name in texts_gold.keys():
        for entity in entities_dict[file_name]:
            gold_phrase = texts_gold[file_name][entity["start"]:entity["end"]]
            match = find_match(texts_htr[file_name], gold_phrase, max_diff=3)
            if not match:
                cer = 100  # 16 / len(gold_phrase)
                results.append([htr_model, file_name, cer, gold_phrase, ""])
            else:
                gold_phrase = regex.sub("\n", " ", gold_phrase)
                htr_phrase = regex.sub("\n",
                                       " ",
                                       texts_htr[file_name][match["start"]:
                                                            match["end"]])
                cer = round(fastwer.score_sent(htr_phrase,
                                               gold_phrase,
                                               char_level=True), 1)
                found += 1
                if match["label"] == "match":
                    correct += 1
                    if debug:
                        print(f"CORRECT: {gold_phrase}")
                elif debug:
                    print(f"PARTIAL: {gold_phrase} # {htr_phrase}")
                results.append([htr_model, file_name, cer,
                                gold_phrase, htr_phrase])
            total_length += len(gold_phrase)
            total_cer += cer * len(gold_phrase)
            total += 1
    average_cer = round(total_cer/total_length, 1)
    print(f"correct: {int(100*found/total)}%; " +
          f" partial {int(100*(found-correct)/total)}%")
    print(f"average cer: {average_cer}%")
    return average_cer, results

In [None]:
def average_entities_length(entities_dict):
    total_length = 0
    nbr_of_entities = 0
    for file_name in entities_dict:
        for entity in entities_dict[file_name]:
            nbr_of_entities += 1
            total_length += entity["end"] - entity["start"]
    return round(total_length / nbr_of_entities, 1)

In [None]:
def sort_entities(entities):
    if len(entities) <= 1:
        return entities
    first_entity = entities.pop(0)
    entities_out = sort_entities(entities)
    for index in range(0, len(entities_out)):
        if (entities_out[index]["start"] > first_entity["start"] or
            (entities_out[index]["start"] == first_entity["start"] and
             entities_out[index]["end"] > first_entity["end"])):
            entities_out.insert(first_entity, index)
            break
    if len(entities_out) == len(entities):
        entities_out.append(first_entity)
    return entities_out

In [None]:
def add_text(entities, text):
    for entity in entities:
        entity["text"] = text[entity["start"]:entity["end"]]
    return entities

In [None]:
def remove_entities_with_newlines(entities):
    return [entity
            for entity in entities
            if not regex.search(r"\n", entity["text"])]

In [None]:
def find_entities(texts_gold):
    entities_dict = {}
    for file_name in texts_gold.keys():
        text = texts_gold[file_name]
        analysis_lower_cased = nlp(text.lower())
        entities_lower_cased = get_spacy_entities(analysis_lower_cased)
        analysis = nlp(text)
        entities = get_spacy_entities(analysis)
        entities = select_entities(entities, entities_lower_cased, text)
        entities = expand_entities(entities, text)
        entities = shrink_entities(entities, text)
        entities = sanity_check_entities(entities, text)
        entities = add_text(entities, text)
        entities = remove_entities_with_newlines(entities)
        entities_dict[file_name] = entities
    return entities_dict

In [None]:
cers_per_name = []
cers_per_experiment = {}
for htr_model in sorted(data_dir_htr.keys(),
                        key=lambda x: int(regex.sub("_NB", "", regex.sub("CurTSSR", "", x)))):
    print(colored(f"{htr_model}", attrs=['bold']))
    texts_gold, texts_htr = read_files(data_dir_gold, data_dir_htr[htr_model])
    entities_dict = find_entities(texts_gold)
    average_cer, results_per_model = evaluate_entities(texts_gold,
                                                       texts_htr,
                                                       entities_dict,
                                                       htr_model)
    cers_per_name.extend(results_per_model)
    cers_per_experiment[htr_model] = average_cer
pd.DataFrame(cers_per_name, columns=["experiment_name",
                                     "file_name",
                                     "cer",
                                     "gold_phrase",
                                     "htr_phrase"]).to_csv("cers_per_name.csv", index=False)
pd.DataFrame(cers_per_experiment,
             index=[0]).T.to_csv("cers_names_per_experiment.csv")

**Evaluation of name accuracy (all person names including signatures):**

| System | Model |    Correct | Partially Correct |
| ------ | ----- | ---------- | ----------------- |
| Transkribus | HTR_Curacao_bestModel | 70% | 38% |
| Transkribus | Curacao_Dutchess      | 75% | 39% |
| Loghi       | general               | 55% | 38% |

In [None]:
average_entities_length(entities_dict)

In [None]:
render_text(texts_gold["0040_p033.xml"], entities_dict["0040_p033.xml"])

## 7. Find professions

In [None]:
def count_entities(entities_dict):
    entities_count = {}
    for file_name in entities_dict.keys():
        for entity in entities_dict[file_name]:
            if entity["label"] in entities_count:
                entities_count[entity["label"]] += 1
            else:
                entities_count[entity["label"]] = 1
    return entities_count

In [None]:
def find_professions(texts):
    entities = {}
    for file_name in texts.keys():
        entities[file_name] = []
        for profession in PROFESSIONS_LIST:
            for match in regex.finditer(profession,
                                        regex.sub("\n",
                                                  " ",
                                                  texts[file_name]),
                                        regex.IGNORECASE):
                entities[file_name].append({"start": match.start(),
                                            "end": match.end(),
                                            "label": profession})
    return entities


PROFESSIONS_LIST = ["aanspreker",  "agent van politie", "agent", "beroep geen",
                    "bode", "boekbinder", "broodbakster", "broodbak ster",
                    "chauffeur", "geemploijeerde", "geemployeerde",
                    "gemploijeerde", "goudsmid", "gouvernementsambtenaar",
                    "gouvernements ambtenaar", "handelaar", "hoedenmaakster",
                    "hoedenmaak ster", "hoedenvlechtster", "kantonrechter",
                    "kleermaker", "kleer maker", "koopman", "koopvrouw",
                    "kuiper", "kui per", "landbouwer", "magazijnsknecht",
                    "metselaar", "naaister", "onbekend van beroep",
                    "opzichter", "pontvoerder", "pottenbaakster",
                    "schoenmaker", "schoen maker", "schilder",
                    "schrijnwerker", "sjouwer", "stoker", "timmerman",
                    "veldarbeider", "veldwachter", "visscher", "waschvrouw",
                    "wasch vrouw", "werkman", "werk man", "werkvrouw",
                    "werk vrouw", "werktuigkundige", "zeeman",
                    "zonder beroep"]

In [None]:
data_dir_htr.keys()

In [None]:
for htr_model in data_dir_htr.keys():
    print(colored(f"{htr_model}", attrs=['bold']))
    texts_gold, texts_htr = read_files(data_dir_gold, data_dir_htr[htr_model])
    entities_dict = find_professions(texts_gold)
    evaluate_entities(texts_gold, texts_htr, entities_dict)

**Evaluation of profession accuracy**:

| System | Model |    Correct | Partially Correct |
| ------ | ----- | ---------- | ----------------- |
| Transkribus | HTR_Curacao_bestModel | 78% | 11% |
| Transkribus | Curacao_Dutchess      | 80% |  8% |
| Loghi       | general               | 52% | 25% |

In [None]:
average_entities_length(entities_dict)

In [None]:
for file_name in entities_dict:
    print(file_name, len(entities_dict[file_name]))

In [None]:
render_text(texts_gold["0040_p033.xml"], [])

In [None]:
for file_name in texts_gold.keys():
    text = texts_gold[file_name]
    for match in regex.finditer("beroep", text, regex.IGNORECASE):
        profession_start = match.span()[1]
        while (profession_start < len(text) and
               regex.search("\\s", text[profession_start])):
            profession_start += 1
        profession_end = profession_start
        while (profession_end < len(text) and
               regex.search("\\S", text[profession_end])):
            profession_end += 1
        profession = text[profession_start: profession_end].lower()
        if profession not in PROFESSIONS_LIST:
            print(profession)

In [None]:
for file_name in texts_gold.keys():
    if regex.search("kui", texts_gold[file_name], regex.IGNORECASE):
        print(file_name, texts_gold[file_name])

In [None]:
entities_dict