## Bertalign

`https://github.com/bfsujason/bertalign`

Bertalign is a sentence alignment module for creating parallel corpora and translation memories. It uses multilingual sentence transformer models to align bilingual sentences and documents. Refer to the following paper for details:

`https://academic.oup.com/dsh/article-abstract/38/2/621/6965034`

I've forked the above and made certain modifications to improve sentence tokenization for Chinese. It's available from:

`https://github.com/ruben-tsui/bertalign`

## User inputs: input file names, source and target languages, `max_align` parameter

In [None]:
#### USER INPUTS ####
# (1) Your source and target input file names must conform to the format:
#     sample.en.txt
#     sample.zh.txt
#     and you will enter the base file name as follows
base = '/content/MonteCristoTome1'

# (2) Source and target languages (check ISO 689-1 for language codes)
src_lang = 'zh'   # source language, e.g. en, zh, es, fr, de, ja, it
tgt_lang = 'fr'   # target language

# (3) Considers all "N sentences-to-M sentences" alignment, N + M <= max_align;
#     The larger this value, the longer it will take to complete the alignment.
max_align = 9

#### DO NOT CHANGE THE FOLLOWING ####
# input files (source and target texts)
fin_src  = f'{base}.{src_lang}.txt'
fin_tgt  = f'{base}.{tgt_lang}.txt'
# output (aligned) file (plain text and Excel)
fon_txt  = f'{base}.bertalign.n{max_align}.{src_lang}-{tgt_lang}.txt'
fon_xlsx = f'{base}.bertalign.n{max_align}.{src_lang}-{tgt_lang}.xlsx'

In [None]:
!rm -rf bertalign/

In [None]:
!git clone https://github.com/ruben-tsui/bertalign

Cloning into 'bertalign'...
remote: Enumerating objects: 1174, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 1174 (delta 37), reused 29 (delta 29), pack-reused 1125 (from 2)[K
Receiving objects: 100% (1174/1174), 303.21 MiB | 27.86 MiB/s, done.
Resolving deltas: 100% (434/434), done.


In [None]:
cd /content/bertalign

/content/bertalign


In [None]:
%%capture
!pip install -r requirements.txt

In [None]:
from bertalign import Bertalign

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/445 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

  model.load_state_dict(torch.load(os.path.join(input_path, 'pytorch_model.bin'), map_location=torch.device('cpu')))


In [None]:
cd ..

/content


## For subsequent text alignments within the same Colab session, modify the first cell (input file names and other parameters) and do a `Runtime -> Run After` in the following cell.

In [None]:
print(f"Reading source text: {fin_src}...")
src = open(fin_src, 'r', encoding='utf-8').read()
print(f"Reading target text: {fin_tgt}...")
tgt = open(fin_tgt, 'r', encoding='utf-8').read()

Reading source text: /content/MonteCristoTome1.zh.txt...
Reading target text: /content/MonteCristoTome1.fr.txt...


### Sentence aligning begins below

In [None]:
%%time
fix_src_lines = False #True #False
aligner = Bertalign(src, tgt, max_align=9, src_lang=src_lang, tgt_lang=tgt_lang, fix_src_lines=fix_src_lines)


Source language: zh, Number of sentences: 7339
Target language: fr, Number of sentences: 5857
Embedding source and target text using LaBSE ...
CPU times: user 14min 26s, sys: 3.04 s, total: 14min 29s
Wall time: 14min 12s


In [None]:
%%time
if fix_src_lines:
    print(f"[0] fix_src_lines = {fix_src_lines}")
    aligner.align_sents_1toN()
else:
    print(f"[1] fix_src_lines = {fix_src_lines}")
    aligner.align_sents()

[1] fix_src_lines = False
Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligning 7339 zh sentences to 5857 fr sentences

CPU times: user 5.59 s, sys: 509 ms, total: 6.1 s
Wall time: 7.44 s


In [None]:
from bertalign import model
from bertalign.utils import *
import numpy as np

Creates the sentence-aligned text in .txt (plain text) format

In [None]:
data = []
with open(fon_txt, 'w', encoding='utf-8', newline='\n') as fo:
    cnt = 0
    for s, t in aligner.result:
        cnt += 1
        # source
        if src_lang in ['zh', 'ja', 'kr']:
            s_delim = ''  # no space between sentences
        else:
            s_delim = ' ' # one space between sentences
        ss = [aligner.src_sents[sidx].strip() for sidx in s]
        ss = s_delim.join(ss)
        sv = model.model.encode(ss, normalize_embeddings=True)
        # target
        if tgt_lang in ['zh', 'ja', 'kr']:
            t_delim = ''  # no space between sentences
        else:
            t_delim = ' ' # one space between sentences
        tt = [aligner.tgt_sents[tidx].strip() for tidx in t]
        tt = t_delim.join(tt)
        tv = model.model.encode(tt, normalize_embeddings=True)
        #
        cossim = np.dot(sv, tv)
        cosdist = 1 - cossim
        line = f"{cosdist:.4f}\t{str(s)}\t{ss}\t{str(t)}\t{tt}\n"
        data.append(line)
        fo.write(line)

Creates the sentence-aligned text in .xlsx (Microsoft Excel) format

In [None]:
import pandas as pd
import openpyxl
from openpyxl.styles import PatternFill, Border, Side, Alignment, Protection, Font
from openpyxl.utils.dataframe import dataframe_to_rows

# Create a new workbook
wb = openpyxl.Workbook()
# Select the active sheet
ws = wb.active
# Set column widths
ws.column_dimensions['A'].width = 10
ws.column_dimensions['B'].width = 10
ws.column_dimensions['C'].width = 10
ws.column_dimensions['D'].width = 50
ws.column_dimensions['E'].width = 10
ws.column_dimensions['F'].width = 65

df = pd.DataFrame([x.split('\t') for x in data], columns=['cosdist', 'cols_s', src_lang, 'cols_t',  tgt_lang])

for r in dataframe_to_rows(df, index=True, header=True):
    ws.append(r)

# Set cell alignment
alignment = Alignment(horizontal='general',
                      vertical='top',
                      wrap_text=True)

for row in ws[f'A1:F{cnt+10}']:
    for cell in row:
        cell.alignment = alignment

# Save the workbook
wb.save(fon_xlsx)
