# Finding parallel text in the Mambai Language Manual with hunalign

- Inputs: `Mambai Language Manual parallel.docx`
- Outputs: parallel corpus in TSV format: `mambai_language_manual_parallel.tsv`

Requirements:

1. Setup Python requirements: `python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt`
2. Setup hunalign, and adjust HUNALIGN_BIN accordingly
3. Run this notebook


In [None]:
HUNALIGN_BIN = "../hunalign/src/hunalign/hunalign"

### 1. Extract sentences from doc, with Mambai in bold and English in regular fonts


In [44]:
from docx import Document


def extract_sentences(docx_path):
    mambai_sentences = []
    english_sentences = []

    doc = Document(docx_path)

    for paragraph in doc.paragraphs:
        # Temporary storage for constructing sentences
        mambai_temp = ""
        english_temp = ""

        for run in paragraph.runs:
            # if the sentence is all upper case, it's a section delimiter, so add a <p> to both mambai and english sents
            if run.text.isupper():
                mambai_sentences.append("<p>")
                english_sentences.append("<p>")
                continue
            if run.bold:
                # Append text to Mambai sentence
                mambai_temp += run.text
            else:
                # Append text to English sentence
                english_temp += run.text

        # Add to respective lists if the sentence is not empty
        if mambai_temp:
            mambai_sentences.append(mambai_temp)
        if english_temp:
            english_sentences.append(english_temp)

    return mambai_sentences, english_sentences


file_path = "/Users/raphaelmerx/Downloads/Mambai/Mambai Language Manual parallel.docx"
mambai_sentences, english_sentences = extract_sentences(file_path)

print(
    f"Found a total of {len(mambai_sentences)} potential Mambai sentences and {len(english_sentences)} English sentences."
)

Found a total of 1125 potential Mambai sentences and 1146 English sentences.


### 2. Sentencize and tokenize text

Using nltk, in preparation for hunalign, which expects one sentence per line, and text to be tokenized.


In [46]:
# nltk: sentencize
import nltk
from nltk.tokenize import sent_tokenize


def break_down_sentences(items):
    new_items = []
    for item in items:
        item = item.replace("\n", " ").strip()
        if "<p>" in item:
            new_items.append(item)
        else:
            new_items.extend(sent_tokenize(item))
    return new_items


mambai_sentences = break_down_sentences(mambai_sentences)
english_sentences = break_down_sentences(english_sentences)
print(f"Total of {len(mambai_sentences)} sentences in mambai_sentences.")
print(f"Total of {len(english_sentences)} sentences in english_sentences.")

Total of 1279 sentences in mambai_sentences.
Total of 1273 sentences in english_sentences.


In [48]:
# nltk: tokenize

import nltk

nltk.download("punkt")
from nltk.tokenize import word_tokenize


def tokenize_sentences(sentences):
    tokenized_sentences = []
    for sentence in sentences:
        if "<p>" in sentence:
            tokenized_sentences.append(["<p>"])
        else:
            tokenized_sentences.append(word_tokenize(sentence))
    return tokenized_sentences


mambai_tokenized = tokenize_sentences(mambai_sentences)
english_tokenized = tokenize_sentences(english_sentences)
print(english_sentences)

print(
    f"Total of {len(mambai_tokenized)} tokenized Mambai sentences and {len(english_tokenized)} tokenized English sentences."
)

# write to eng.sent and mgm.sent
with open("eng.sent", "w") as f:
    for sentence in english_tokenized:
        f.write(" ".join(sentence) + "\n")

with open("mgm.sent", "w") as f:
    for sentence in mambai_tokenized:
        f.write(" ".join(sentence) + "\n")

['<p>', 'Good morning.', 'Good afternoon.', 'Good night.', 'How are you?', 'Well, thanks.', 'And how are you?', 'Goodbye.', 'See you tomorrow.', "That's good!", 'Congratulations!', 'Get well soon!', 'Please.', 'Thank you.', 'Thank you very much.', "Don't mention it.", 'Keep up your spirits!', 'There is/are.', "There isn't/are not", "Excuse me (= I'm sorry Excuse me (for a moment) Please excuse me.", "I'm sorry (to have to do this) Do you mind?", 'Is it all right?', 'May I go?', 'Please give me/us...', 'Please also bring me/us....', 'Please help me!', 'Please hurry!', "It's important.", "It's urgent.", "I can't.", 'Is it possible?', "It's not possible.", "I'm (very) tired.", "I'm hungry.", "I'm thirsty.", 'I am (very) sick.', "(man's reply); (woman's reply)", 'Take me to a doctor.', 'Take me to a hospital.', 'Where is the...?', 'Here you are.', "I don't know.", "I don't understand.", 'Do you understand?', 'Do you see what I mean?', "Don't worry.", "I don't want this.", "I don't like thi

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/raphaelmerx/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 3. hunalign: find parallel sentences

Relies on the `mgm-eng.stem.dic` produced by the extract_mambai_dict.ipynb notebook


In [49]:
HUNALIGN_BIN = "../hunalign/src/hunalign/hunalign"

!{HUNALIGN_BIN} mgm-eng.stem.dic mgm.sent eng.sent > mgm-eng.aligned

Reading dictionary...
1280 source language sentences read.
1274 target language sentences read.
quasiglobal_stopwordRemoval is set to 0
Simplified dictionary ready.
Rough translation ready.
0 
100 200 300 400 500 600 700 800 900 1000 1100 1200 
Rough translation-based similarity matrix ready.
Matrix built.
Trail found.
Align ready.
Global quality of unfiltered align 0.989281
quasiglobal_spaceOutBySentenceLength is set to 1
Trail spaced out by sentence length.
Global quality of unfiltered align after realign 0.989281
Quality 0.989281


In [58]:
!python ladder2text.py  mgm-eng.aligned mgm.sent eng.sent > aligned_text

In [1]:
import nltk
import random
from nltk.tokenize.treebank import TreebankWordDetokenizer

THRESHOLD = 0.2

# how many lines in aligned_text?
with open("aligned_text", "r") as f:
    print(f"Total of {len(f.readlines())} lines in aligned_text.")


# read aligned_text, tab-separated
def get_parallel_sentences(aligned_text):
    parallel_sentences = []
    with open(aligned_text, "r") as f:
        for line in f.readlines():
            score, mgm, eng = line.split("\t")
            if float(score) > THRESHOLD:
                mgm = mgm.replace(" ~~~ ", " ").strip()
                eng = eng.replace(" ~~~ ", " ").strip()
                parallel_sentences.append((mgm, eng))
    return parallel_sentences


parallel_sentences = get_parallel_sentences("aligned_text")

print(f"Total of {len(parallel_sentences)} parallel sentences found.")

detokenizer = TreebankWordDetokenizer()
for i, (mgm, eng) in enumerate(parallel_sentences):
    parallel_sentences[i] = (
        detokenizer.detokenize(mgm.split()),
        detokenizer.detokenize(eng.split()),
    )

random.sample(parallel_sentences, 10)

Total of 1275 lines in aligned_text.
Total of 1187 parallel sentences found.


[('Deskulpa, Senór!', 'We are very sorry, Sir!'),
 ('It gosta kuartu mensapa ni, Senór (Senora)?',
  'What kind of room do you want, Sir?'),
 ('Karreta klao tel.', 'The car has broken down.'),
 ('Au gala ....', 'My name is ..'),
 ('Prepara meza kid la am têl.', 'We want a table for three.'),
 ('Balikan dlai.', "Don't move!"),
 ('Rom foer tel sôp.', 'They are dirty.'),
 ('Au hakarak aprende gase It tero.', 'I would like to learn your language.'),
 ('Au hakarak lao Otél X.', 'I want to go to the X.'),
 ('Lelbán pil ma rai?', 'How long will it take to arrive?')]

### Save parallel corpus


In [59]:
import csv

with open("mambai_language_manual_parallel.tsv", "w") as f:
    writer = csv.writer(f, delimiter="\t")
    writer.writerow(["Mambai", "English"])
    for mgm, eng in parallel_sentences:
        writer.writerow([mgm, eng])