# Chinese-Vietnamese Sentence Alignment using Bertalign + SBERT

## Features:
- Uses LaBSE (Language-agnostic BERT Sentence Embedding) for multilingual sentence embeddings
- Performs sentence segmentation for both Chinese and Vietnamese
- Aligns sentences using Bertalign

## 0. Setup and Import

In [1]:
import os

if not os.path.exists('bertalign'):
    !git clone https://github.com/bfsujason/bertalign.git

os.chdir('bertalign')

!pip install faiss-cpu

if os.path.exists('requirements.txt'):
    !pip install -r requirements.txt

!pip install -e .

os.chdir('..')

print("Installation complete. Please RESTART your runtime/kernel!")

[31mERROR: Could not find a version that satisfies the requirement faiss-gpu==1.7.2 (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for faiss-gpu==1.7.2[0m[31m
[31mERROR: Could not find a version that satisfies the requirement faiss-gpu==1.7.2 (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for faiss-gpu==1.7.2[0m[31m
[0mObtaining file:///content/bertalign
Obtaining file:///content/bertalign
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: Bertalign
  Attempting uninstall: Bertalign
Installing collected packages: Bertalign
  Attempting uninstall: Bertalign
    Found existing installation: Bertalign 0.1.0
    Uninstalling Bertalign-0.1.0:
    Found existing installation: Bertalign 0.1.0
    Uninstalling Bertalign-0.1.0:
      Successfully uninstalled Bertalign-0.1.0
  Running setup.py develop for Bertalign
      Successfully uninstall

In [9]:
import sys
import re
import json
import os
from typing import List, Dict
from pathlib import Path

# Get the notebook's directory and set it as working directory
# This ensures relative paths work correctly regardless of where kernel started
NOTEBOOK_DIR = Path(os.getcwd()).resolve()

# If we're inside 'bertalign' folder from installation, go back up
if NOTEBOOK_DIR.name == 'bertalign' and (NOTEBOOK_DIR.parent / 'bertalign_sbert_notebook.ipynb').exists():
    NOTEBOOK_DIR = NOTEBOOK_DIR.parent
    os.chdir(NOTEBOOK_DIR)

print(f"Working directory: {NOTEBOOK_DIR}")

# Add local bertalign package to path
sys.path.insert(0, str(NOTEBOOK_DIR / 'bertalign'))

Working directory: /content


## 1. Create Data for Training

In [None]:
def load_and_filter_json(input_json_path, start_id, end_id, rename_cn_to_zh=False, source_name=None):
    """
    Loads and filters a JSON file by ID range.
    
    Args:
        input_json_path (str): Path to the source JSON file.
        start_id (int): The starting ID of the range (inclusive).
        end_id (int): The ending ID of the range (inclusive).
        rename_cn_to_zh (bool): If True, rename 'cn' key to 'zh'.
        source_name (str): Name of the source to add to each item.
    
    Returns:
        list: Filtered data items.
    """
    print(f"Loading '{input_json_path}' for IDs between {start_id} and {end_id}...")
    
    with open(input_json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    if not isinstance(data, list):
        print("Error: The root of the JSON file is not a list.")
        return []

    # Filter the data based on the id range
    filtered_data = [
        item for item in data
        if isinstance(item, dict) and 'id' in item and start_id <= item.get('id', -1) <= end_id
    ]
    
    # Rename 'cn' to 'zh' if needed
    if rename_cn_to_zh:
        for item in filtered_data:
            if 'cn' in item:
                item['zh'] = item.pop('cn')
    
    # Add source name if provided
    if source_name:
        for item in filtered_data:
            item['source'] = source_name
    
    print(f"  -> Loaded {len(filtered_data)} items")
    return filtered_data


def combine_json_subsets(output_path, *sources):
    """
    Combines multiple JSON sources into a single file.
    
    Args:
        output_path (str): Path to save the combined JSON file.
        *sources: Tuples of (input_path, start_id, end_id, rename_cn_to_zh, source_name)
    """
    combined_data = []
    
    for input_path, start_id, end_id, rename_cn_to_zh, source_name in sources:
        items = load_and_filter_json(input_path, start_id, end_id, rename_cn_to_zh, source_name)
        combined_data.extend(items)
    
    # Write the combined data (keep original IDs)
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(combined_data, f, ensure_ascii=False, indent=4)
    
    print(f"\nSuccessfully created '{output_path}' with {len(combined_data)} total items.")


In [None]:
if not os.path.exists("data.json"):
    combine_json_subsets(
        "data.json",
        # (input_file, start_id, end_id, rename_cn_to_zh, source_name)
        ("json1.json", 1, 212, True, "json1"),   # json1 uses 'cn' -> rename to 'zh'
        ("json2.json", 1, 1163, False, "json2"), # json2 already uses 'zh'
    )
else:
    print("data.json already exists, skipping creation.")

## 2. Patch Bertalign for Vietnamese Support

The original Bertalign doesn't support Vietnamese. We need to patch:
- Language detection (to identify Vietnamese text)
- Sentence splitting (Vietnamese uses similar punctuation to English)

In [16]:
# Import bertalign modules
import sys
sys.path.insert(0, 'bertalign')  # Add the repo root to path

# Now import the submodules directly
import bertalign.utils as bertalign_utils
import bertalign.aligner as bertalign_aligner

def patched_detect_lang(text):
    """
    Simple language detection based on character ranges.
    Returns 'zh' for Chinese, 'vi' for Vietnamese.
    """
    # Count Chinese characters
    chinese_chars = sum(1 for c in text if '\u4e00' <= c <= '\u9fff')
    # Count Vietnamese diacritics
    vietnamese_chars = sum(1 for c in text if c in 'àáạảãăắằặẳẵâấầậẩẫèéẹẻẽêếềệểễìíịỉĩòóọỏõôốồộổỗơớờợởỡùúụủũưứừựửữỳýỵỷỹđÀÁẠẢÃĂẮẰẶẲẴÂẤẦẬẨẪÈÉẸẺẼÊẾỀỆỂỄÌÍỊỈĨÒÓỌỎÕÔỐỒỘỔỖƠỚỜỢỞỠÙÚỤỦŨƯỨỪỰỬỮỲÝỴỶỸĐ')
    
    if chinese_chars > len(text) * 0.1:
        return 'zh'
    elif vietnamese_chars > 0:
        return 'vi'
    else:
        return 'zh' if chinese_chars > vietnamese_chars else 'vi'

def patched_split_sents(text, lang):
    """
    Split text into sentences. Adds Vietnamese support.
    """
    if lang == 'zh':
        # Use the original Chinese splitter
        return bertalign_utils._split_zh(text)
    elif lang == 'vi':
        # Vietnamese sentence splitting using punctuation
        text = re.sub(r'([.?!])\s+', r'\1\n', text)
        sents = [s.strip() for s in text.split('\n') if s.strip()]
        return sents
    elif lang in bertalign_utils.LANG.SPLITTER:
        from sentence_splitter import SentenceSplitter
        splitter = SentenceSplitter(language=lang)
        sents = splitter.split(text=text)
        sents = [sent.strip() for sent in sents]
        return sents
    else:
        raise Exception(f'The language {lang} is not supported yet.')

# Add Vietnamese to LANG.SPLITTER and LANG.ISO
bertalign_utils.LANG.SPLITTER['vi'] = 'Vietnamese'
bertalign_utils.LANG.ISO['vi'] = 'Vietnamese'

# Apply the patches
bertalign_utils.detect_lang = patched_detect_lang
bertalign_utils.split_sents = patched_split_sents
bertalign_aligner.detect_lang = patched_detect_lang
bertalign_aligner.split_sents = patched_split_sents

print("Patches applied")

Patches applied


## 3. Import Bertalign

Now we can import the patched Bertalign. It uses LaBSE for sentence embeddings by default.

In [4]:
from bertalign import Bertalign

print("Bertalign imported successfully!")

Bertalign imported successfully!


## 4. Define Alignment Function

In [11]:
def align_zh_vi(zh_text: str, vi_text: str, 
                source: str = "unknown", 
                source_id: int = 0,
                max_align: int = 5,
                top_k: int = 3,
                win: int = 5) -> List[Dict]:
    """
    Aligns Chinese and Vietnamese text using Bertalign with LaBSE.
    
    Parameters:
    -----------
    zh_text : str
        Chinese text to align
    vi_text : str
        Vietnamese text to align
    source : str
        Source identifier for the text pair
    source_id : int
        ID of the source document
    max_align : int
        Maximum number of sentences to merge in alignment (default: 5)
    top_k : int
        Number of top candidates to consider (default: 3)
    win : int
        Window size for alignment search (default: 5)
        
    Returns:
    --------
    List[Dict]: List of aligned sentence pairs with metadata
    """
    
    # Initialize Bertalign
    aligner = Bertalign(
        src=zh_text,
        tgt=vi_text,
        max_align=max_align,
        top_k=top_k,
        win=win,
        skip=-0.1,
        margin=True,
        len_penalty=True,
        is_split=False,
    )
    
    # Perform alignment
    aligner.align_sents()
    
    # Process results
    results = []
    for src_indices, tgt_indices in aligner.result:
        if len(src_indices) > 0 and len(tgt_indices) > 0:
            zh_aligned = ' '.join(aligner.src_sents[src_indices[0]:src_indices[-1]+1])
            vi_aligned = ' '.join(aligner.tgt_sents[tgt_indices[0]:tgt_indices[-1]+1])
            
            results.append({
                'zh': zh_aligned,
                'vi': vi_aligned,
                'source': str(source),
                'source_id': int(source_id),  # Convert numpy int64 to Python int
                'align_type': f"{len(src_indices)}-{len(tgt_indices)}",
                'src_indices': [int(i) for i in src_indices],  # Convert numpy ints
                'tgt_indices': [int(i) for i in tgt_indices]   # Convert numpy ints
            })
    
    return results, aligner

## 6. Process Data from JSON File

In [14]:
def process_data(input_path: str, output_json: str):
    """
    Process a JSON file containing zh-vi text pairs and create aligned corpus.
    Saves results incrementally to JSON after each successful alignment.
    
    Parameters:
    -----------
    input_path : str
        Path to input JSON file
    output_json : str
        Path to output JSON file (results saved incrementally)
    limit : int
        Maximum number of documents to process (optional)
        
    Returns:
    --------
    List[Dict]: All aligned pairs
    """
    from tqdm import tqdm
    
    with open(input_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    
    print(f"Loaded {len(data)} documents")
    print(f"Output: {output_json}")
    
    all_pairs: List[Dict] = []
    
    for item in tqdm(data, desc="Aligning documents"):
        if 'zh' not in item or 'vi' not in item:
            continue
        
        zh_text = item['zh'].strip()
        vi_text = item['vi'].strip()
        
        if not zh_text or not vi_text:
            continue
        
        try:
            aligned, _ = align_zh_vi(
                zh_text, vi_text,
                source=item.get('source', 'unknown'),
                source_id=item.get('id', 0)
            )
            
            if aligned:
                all_pairs.extend(aligned)
                
                # Save incrementally after each successful alignment
                with open(output_json, 'w', encoding='utf-8') as f:
                    json.dump(all_pairs, f, ensure_ascii=False, indent=4)
                
        except Exception as e:
            print(f"\nError processing item {item.get('id', '?')}: {e}")
            continue
    
    print(f"\nTotal aligned pairs: {len(all_pairs)}")
    print(f"Saved to: {output_json}")
    
    return all_pairs

In [15]:
# Change 'subsubset.json' to 'data.json' for full dataset

all_pairs = process_data(
    str(NOTEBOOK_DIR / 'subsubset.json'), 
    output_json=str(NOTEBOOK_DIR / 'corpus_bertalign.json')
)


Bertalign Sentence Alignment
Loaded 5 documents
Output: /home/thienan/Documents/coding/zh-vn-mt/corpus_bertalign.json


Aligning documents:   0%|          | 0/5 [00:00<?, ?it/s]

Source language: Chinese, Number of sentences: 43
Target language: Vietnamese, Number of sentences: 54
Embedding source and target text using LaBSE ...


Aligning documents:  20%|██        | 1/5 [00:27<01:49, 27.31s/it]

Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligning 43 Chinese sentences to 54 Vietnamese sentences

Source language: Chinese, Number of sentences: 44
Target language: Vietnamese, Number of sentences: 50
Embedding source and target text using LaBSE ...


Aligning documents:  40%|████      | 2/5 [00:55<01:24, 28.02s/it]

Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligning 44 Chinese sentences to 50 Vietnamese sentences

Source language: Chinese, Number of sentences: 42
Target language: Vietnamese, Number of sentences: 46
Embedding source and target text using LaBSE ...


Aligning documents:  60%|██████    | 3/5 [01:21<00:53, 26.83s/it]

Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligning 42 Chinese sentences to 46 Vietnamese sentences

Source language: Chinese, Number of sentences: 36
Target language: Vietnamese, Number of sentences: 47
Embedding source and target text using LaBSE ...


Aligning documents:  80%|████████  | 4/5 [01:45<00:25, 25.98s/it]

Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligning 36 Chinese sentences to 47 Vietnamese sentences

Source language: Chinese, Number of sentences: 61
Target language: Vietnamese, Number of sentences: 68
Embedding source and target text using LaBSE ...


Aligning documents: 100%|██████████| 5/5 [02:17<00:00, 27.42s/it]

Performing first-step alignment ...
Performing second-step alignment ...
Finished! Successfully aligning 61 Chinese sentences to 68 Vietnamese sentences


Total aligned pairs: 207
Saved to: /home/thienan/Documents/coding/zh-vn-mt/corpus_bertalign.json



