# Fine-tune MarianMT 2 chiều EN↔VI + đánh giá BLEU1..BLEU4

MarianMT (OPUS-MT) là mô hình **định hướng** (directional), vì vậy để dịch **2 chiều EN↔VI** theo cách “chuẩn MarianMT”, ta sẽ **fine-tune 2 model**:

- **Model A:** `Helsinki-NLP/opus-mt-en-vi` (EN→VI)
- **Model B:** `Helsinki-NLP/opus-mt-vi-en` (VI→EN)

Notebook này sẽ:
1. Đọc dữ liệu song ngữ (train/test)
2. Fine-tune **cả 2 hướng** (chạy lần lượt)
3. Đánh giá **BLEU1, BLEU2, BLEU3, BLEU4** trên tập test cho từng hướng
4. Lưu model ra thư mục `vlsp_marian_en-vi/` và `vlsp_marian_vi-en/` (


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip -q install "transformers>=4.46.0" "datasets>=3.0.0" sacrebleu sentencepiece accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
from pathlib import Path
import numpy as np
import torch
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)
import sacrebleu
from sacrebleu.metrics import BLEU

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)


Device: cuda


In [None]:
# 1) CẤU HÌNH DỮ LIỆU
DATA_DIR = Path("/content")

TRAIN_EN = DATA_DIR / "train.en.txt"
TRAIN_VI = DATA_DIR / "train.vi.txt"
TEST_EN  = DATA_DIR / "public_test.en.txt"
TEST_VI  = DATA_DIR / "public_test.vi.txt"

MAX_TRAIN_SAMPLES = None

VALID_RATIO = 0.05
SPLIT_SEED = 42

MAX_SOURCE_LENGTH = 128
MAX_TARGET_LENGTH = 128

GEN_NUM_BEAMS = 4
GEN_MAX_LENGTH = 128

# Training hyperparams
BATCH_SIZE = 8
NUM_EPOCHS = 2
LR = 5e-5
WEIGHT_DECAY = 0.0

# Chạy cả 2 hướng
DIRECTIONS = ["en-vi", "vi-en"]

# In ra xem file có tồn tại không
for p in [TRAIN_EN, TRAIN_VI, TEST_EN, TEST_VI]:
    print(p, "exists =", p.exists())


/content/train.en.txt exists = True
/content/train.vi.txt exists = True
/content/public_test.en.txt exists = True
/content/public_test.vi.txt exists = True


In [None]:
# 2) ĐỌC CORPUS SONG NGỮ
def load_lines(path: Path, max_samples: int = None):
    with open(path, encoding="utf-8") as f:
        lines = [l.strip() for l in f.readlines()]
    if max_samples is not None:
        lines = lines[:max_samples]
    return lines

def load_parallel(en_path: Path, vi_path: Path, max_samples: int = None):
    en = load_lines(en_path, max_samples=max_samples)
    vi = load_lines(vi_path, max_samples=max_samples)
    if len(en) != len(vi):
        m = min(len(en), len(vi))
        print(f"Cảnh báo lệch dòng: EN={len(en)} VI={len(vi)} -> cắt còn {m}")
        en, vi = en[:m], vi[:m]
    return en, vi

train_en, train_vi = load_parallel(TRAIN_EN, TRAIN_VI, max_samples=MAX_TRAIN_SAMPLES)
test_en,  test_vi  = load_parallel(TEST_EN,  TEST_VI,  max_samples=None)

print("Train pairs:", len(train_en))
print("Test pairs :", len(test_en))

for i in range(3):
    print("-" * 80)
    print("EN :", train_en[i][:200])
    print("VI :", train_vi[i][:200])


Train pairs: 500000
Test pairs : 3000
--------------------------------------------------------------------------------
EN : To evaluate clinical, subclinical symptoms of patients with otitis media with effusion and V.a at otorhinolaryngology department – Thai Nguyen national hospital
VI : Nghiên cứu đặc điểm lâm sàng, cận lâm sàng bệnh nhân viêm tai ứ dịch trên viêm V.A tại Khoa Tai mũi họng - Bệnh viện Trung ương Thái Nguyên
--------------------------------------------------------------------------------
EN : Evaluate clinical, subclinical symptoms of patients with otittis media effusion and V a at otorhinolaryngology department - Thai Nguyên National Hospital.
VI : Đánh giá đặc điểm lâm sàng, cận lâm sàng bệnh nhân viêm tai ứ dịch trên viêm V.a tại Khoa Tai mũi họng - Bệnh viện Trung ương Thái Nguyên.
--------------------------------------------------------------------------------
EN : There was a relation between vasodilatation and vaginal dysfunction.
VI : Có sự liên quan giữa độ q

In [None]:
# 3) TẠO DATASET THEO HƯỚNG
def make_dataset(src_list, tgt_list, test_src_list, test_tgt_list, valid_ratio=0.05, seed=42):
    raw_train = Dataset.from_dict({"src": src_list, "tgt": tgt_list})
    raw_test  = Dataset.from_dict({"src": test_src_list, "tgt": test_tgt_list})
    split = raw_train.train_test_split(test_size=valid_ratio, seed=seed)
    return DatasetDict({
        "train": split["train"],
        "validation": split["test"],
        "test": raw_test,
    })

# Ví dụ: tạo dataset cho en->vi
dataset_en_vi = make_dataset(train_en, train_vi, test_en, test_vi, valid_ratio=VALID_RATIO, seed=SPLIT_SEED)
print(dataset_en_vi)
print(dataset_en_vi["train"][0])


DatasetDict({
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 475000
    })
    validation: Dataset({
        features: ['src', 'tgt'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 3000
    })
})
{'src': 'Conclusions: Communication skills was taught to the whole of second-year students in Hanoi Medical University at the year 2013 - 2014 with 2 credits.', 'tgt': 'Kết luận: Môn học KNGT đã được giảng cho toàn bộ khối sinh viên Y2 tại trường Đại học Y Hà Nội từ năm học 2013-2014 với cấu trúc 1/1.'}


In [None]:
# 4) HÀM TÍNH BLEU1..BLEU4
def postprocess_text(preds, refs):
    preds = [p.strip() for p in preds]
    refs  = [r.strip() for r in refs]
    return preds, refs

def compute_bleu_1to4(preds, refs):
    """preds: list[str], refs: list[str]"""
    preds, refs = postprocess_text(preds, refs)
    out = {}
    for n in [1, 2, 3, 4]:
        bleu_metric = BLEU(max_ngram_order=n)
        score = bleu_metric.corpus_score(preds, [refs]).score
        out[f"bleu{n}"] = round(float(score), 4)
    return out

# Quick sanity check
print(compute_bleu_1to4(["hello world"], ["hello world"]))


{'bleu1': 100.0, 'bleu2': 100.0, 'bleu3': 0.0, 'bleu4': 0.0}


In [None]:
# 5) FINE-TUNE 1 HƯỚNG (EN→VI hoặc VI→EN)
def finetune_direction(direction: str, train_en, train_vi, test_en, test_vi):
    assert direction in ["en-vi", "vi-en"]

    if direction == "en-vi":
        model_name = "Helsinki-NLP/opus-mt-en-vi"
        src_train, tgt_train = train_en, train_vi
        src_test,  tgt_test  = test_en,  test_vi
    else:
        model_name = "Helsinki-NLP/opus-mt-vi-en"
        src_train, tgt_train = train_vi, train_en
        src_test,  tgt_test  = test_vi,  test_en

    print("\n" + "="*100)
    print("Direction:", direction)
    print("Base model:", model_name)
    print("="*100)

    ds = make_dataset(src_train, tgt_train, src_test, tgt_test, valid_ratio=VALID_RATIO, seed=SPLIT_SEED)

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def preprocess_function(batch):
        model_inputs = tokenizer(
            batch["src"],
            max_length=MAX_SOURCE_LENGTH,
            truncation=True,
        )
        labels = tokenizer(
            text_target=batch["tgt"],
            max_length=MAX_TARGET_LENGTH,
            truncation=True,
        )
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    tokenized = ds.map(preprocess_function, batched=True, remove_columns=["src", "tgt"])

    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        if isinstance(preds, tuple):
            preds = preds[0]

        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

        bleu_dict = compute_bleu_1to4(decoded_preds, decoded_labels)

        pred_lens = [np.count_nonzero(p != tokenizer.pad_token_id) for p in preds]
        bleu_dict["gen_len"] = float(np.mean(pred_lens))
        return bleu_dict

    output_dir = f"vlsp_marian_{direction}"

    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="steps",
        logging_steps=100,
        save_total_limit=2,

        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LR,
        weight_decay=WEIGHT_DECAY,

        predict_with_generate=True,
        generation_num_beams=GEN_NUM_BEAMS,
        generation_max_length=GEN_MAX_LENGTH,

        load_best_model_at_end=True,
        metric_for_best_model="bleu4",
        greater_is_better=True,

        fp16=torch.cuda.is_available(),
        report_to="none",
    )

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized["train"],
        eval_dataset=tokenized["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    train_result = trainer.train()
    trainer.save_model(output_dir)      # lưu model tốt nhất
    tokenizer.save_pretrained(output_dir)

    # Evaluate test: dùng predict để lấy BLEU1..4
    pred_out = trainer.predict(tokenized["test"], metric_key_prefix="test")
    test_metrics = pred_out.metrics

    print("\nTest metrics:", {k: v for k, v in test_metrics.items() if "bleu" in k or "gen_len" in k})

    return {
        "direction": direction,
        "model_dir": output_dir,
        "base_model": model_name,
        "test_metrics": test_metrics,
    }


In [None]:
# 6) CHẠY FINE-TUNE CẢ 2 HƯỚNG + IN KẾT QUẢ BLEU1..4
results = []
for d in DIRECTIONS:
    res = finetune_direction(d, train_en, train_vi, test_en, test_vi)
    results.append(res)

print("\n" + "#"*120)
print("TỔNG KẾT BLEU TEST")
print("#"*120)
for res in results:
    m = res["test_metrics"]
    direction = res["direction"]
    # keys sẽ có dạng test_bleu1/test_bleu2/...
    print(f"\nDirection: {direction} | saved at: {res['model_dir']}")
    for n in [1,2,3,4]:
        k = f"test_bleu{n}"
        if k in m:
            print(f"  BLEU{n}: {m[k]:.4f}")



Direction: en-vi
Base model: Helsinki-NLP/opus-mt-en-vi


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

source.spm:   0%|          | 0.00/809k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/756k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



Map:   0%|          | 0/475000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

pytorch_model.bin:   0%|          | 0.00/289M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/289M [00:00<?, ?B/s]

  trainer = Seq2SeqTrainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,Bleu1,Bleu2,Bleu3,Bleu4,Gen Len
1,1.3285,1.223307,71.1159,61.4069,53.481,46.9435,35.85776
2,1.1826,1.114921,72.6369,63.2505,55.5417,49.156,36.05392


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.encoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.embed_positions.weight', 'lm_head.weight'].



Test metrics: {'test_bleu1': 71.4351, 'test_bleu2': 61.8089, 'test_bleu3': 53.9215, 'test_bleu4': 47.4181, 'test_gen_len': 36.02033333333333}

Direction: vi-en
Base model: Helsinki-NLP/opus-mt-vi-en


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

source.spm:   0%|          | 0.00/756k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/809k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



Map:   0%|          | 0/475000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

pytorch_model.bin:   0%|          | 0.00/289M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

  trainer = Seq2SeqTrainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


model.safetensors:   0%|          | 0.00/289M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Bleu1,Bleu2,Bleu3,Bleu4,Gen Len
1,1.4531,1.396628,65.0083,52.1817,43.3317,36.6775,34.21096
2,1.2729,1.273593,66.7273,54.2822,45.624,39.0579,34.42176


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.encoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.embed_positions.weight', 'lm_head.weight'].



Test metrics: {'test_bleu1': 65.0209, 'test_bleu2': 51.8476, 'test_bleu3': 42.7968, 'test_bleu4': 36.0128, 'test_gen_len': 34.952666666666666}

########################################################################################################################
TỔNG KẾT BLEU TEST
########################################################################################################################

Direction: en-vi | saved at: vlsp_marian_en-vi
  BLEU1: 71.4351
  BLEU2: 61.8089
  BLEU3: 53.9215
  BLEU4: 47.4181

Direction: vi-en | saved at: vlsp_marian_vi-en
  BLEU1: 65.0209
  BLEU2: 51.8476
  BLEU3: 42.7968
  BLEU4: 36.0128


In [None]:
# 7) (TUỲ CHỌN) COPY MODEL RA GOOGLE DRIVE

import shutil

IN_COLAB = False
try:
    from google.colab import drive  # type: ignore
    drive.mount("/content/drive")
    IN_COLAB = True
except Exception as e:
    print("Không phải Colab hoặc không mount được Drive:", e)

if IN_COLAB:
    DRIVE_DIR = "/content/drive/MyDrive/BTL_NLP_MT"
    os.makedirs(DRIVE_DIR, exist_ok=True)

    def copy_dir(src, dst):
        if os.path.exists(dst):
            shutil.rmtree(dst)
        shutil.copytree(src, dst)

    src1 = "vlsp_marian_en-vi"
    src2 = "vlsp_marian_vi-en"
    dst1 = os.path.join(DRIVE_DIR, src1)
    dst2 = os.path.join(DRIVE_DIR, src2)

    copy_dir(src1, dst1)
    copy_dir(src2, dst2)

    print("Copied:")
    print(" -", dst1)
    print(" -", dst2)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Copied:
 - /content/drive/MyDrive/BTL_NLP_MT/vlsp_marian_en-vi
 - /content/drive/MyDrive/BTL_NLP_MT/vlsp_marian_vi-en


In [None]:
# 8) INFERENCE: DỊCH 2 CHIỀU
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def load_model_for_direction(direction: str, base_dir: str = "."):
    assert direction in ["en-vi", "vi-en"]
    model_dir = os.path.join(base_dir, f"vlsp_marian_{direction}")
    tok = AutoTokenizer.from_pretrained(model_dir)
    mdl = AutoModelForSeq2SeqLM.from_pretrained(model_dir).to(device)
    mdl.eval()
    return tok, mdl

tok_envi, mdl_envi = load_model_for_direction("en-vi", base_dir=".")
tok_vien, mdl_vien = load_model_for_direction("vi-en", base_dir=".")

@torch.no_grad()
def translate(text: str, direction: str = "en-vi", num_beams: int = 4, max_length: int = 128) -> str:
    if direction == "en-vi":
        tok, mdl = tok_envi, mdl_envi
    else:
        tok, mdl = tok_vien, mdl_vien

    inputs = tok([text], return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device)
    gen_ids = mdl.generate(**inputs, num_beams=num_beams, max_length=max_length)
    return tok.decode(gen_ids[0], skip_special_tokens=True)

print("EN->VI:", translate("This is a medical sentence about diabetes.", direction="en-vi"))
print("VI->EN:", translate("Đây là một câu y khoa nói về bệnh tiểu đường.", direction="vi-en"))




EN->VI: Đây là một câu hỏi y khoa về bệnh đái tháo đường.
VI->EN: This is a medical question about diabetes.


In [None]:
def count_params(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

def human(n):
    # format kiểu 12.34M / 1.23B
    for unit in ["", "K", "M", "B", "T"]:
        if n < 1000:
            return f"{n:.2f}{unit}"
        n /= 1000
    return f"{n:.2f}P"

# EN->VI model
total_envi, train_envi = count_params(mdl_envi)
print("=== Marian EN->VI ===")
print("Total params    :", f"{total_envi:,}", f"({human(total_envi)})")
print("Trainable params:", f"{train_envi:,}", f"({human(train_envi)})")

# VI->EN model
total_vien, train_vien = count_params(mdl_vien)
print("\n=== Marian VI->EN ===")
print("Total params    :", f"{total_vien:,}", f"({human(total_vien)})")
print("Trainable params:", f"{train_vien:,}", f"({human(train_vien)})")

print("\n=== Combined (2 models) ===")
print("Total params    :", f"{(total_envi+total_vien):,}", f"({human(total_envi+total_vien)})")
print("Trainable params:", f"{(train_envi+train_vien):,}", f"({human(train_envi+train_vien)})")


=== Marian EN->VI ===
Total params    : 72,149,504 (72.15M)
Trainable params: 71,625,216 (71.63M)

=== Marian VI->EN ===
Total params    : 72,177,152 (72.18M)
Trainable params: 71,652,864 (71.65M)

=== Combined (2 models) ===
Total params    : 144,326,656 (144.33M)
Trainable params: 143,278,080 (143.28M)
