<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/mt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/mt.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

**Machine Translation**

- 📝 SALP chapter 13

**Machine translation (MT)** leverages computers to translate text between languages, focusing on `practical tasks` rather than complex literary translation.

  - **Primary Uses of MT**: MT is widely used for information access, such as translating online instructions, recipes, and articles, and helps bridge the digital divide by making information more accessible to speakers of lower-resourced languages.
  - **Computer-Aided Translation (CAT)**: MT supports human translators by generating draft translations that are refined in a post-editing phase, often as part of localization efforts.
  - **Real-Time Communication and Image Translation**: MT now enables on-the-fly speech translation and image-based translations (e.g., translating text on menus or signs captured by a phone camera).
  - **Encoder-Decoder Network Architecture**: MT relies on encoder-decoder networks to manage language differences, such as word order and grammatical structures, effectively mapping complex input sequences to output sequences.

## Language Divergences and Typology
- **Language Universals**: Despite the diversity of around 7,000 languages, some elements are universal or statistically common across languages, such as words for basic human functions and structures like nouns, verbs, questions, and commands, reflecting language's role as a communicative tool.

- **Linguistic Diversity and Typology**: Languages vary significantly, especially in lexical choices and sentence structure. These differences, studied in linguistic typology, influence machine translation, as understanding both unique and systematic language differences helps improve MT models.

### Word Order Typology
- **Word Order Variations**: Languages differ in sentence structure; 
  - SVO (e.g., English), 
  - SOV (e.g., Japanese), 
  - VSO (e.g., Arabic) 
  - Orders impact word and phrase placement, such as prepositions vs. postpositions.

![word order differences](./images/mt/wo.png)

- **Modifier Placement Differences**: Modifier positions vary by language, with adjectives before nouns in English but after nouns in Spanish, affecting translation structure.

### Lexical Divergences
- **Word Translation Context**: Translating words accurately depends on context, as many words, like "bass" or "wall," have multiple meanings across languages, necessitating disambiguation in machine translation.

- **Grammatical Constraints**: Languages impose different grammatical rules, such as gender and plurality. For instance, translating into French requires specifying adjective gender, which may not be present in English.

- **Complex Mappings and Lexical Gaps**: Some concepts translate differently depending on context (e.g., "leg" as body part vs. journey stage in French). Certain words lack direct equivalents across languages, leading to challenges in conveying precise meanings.

![Word overlap](./images/mt/ol.png)

- **Event Description Differences**: Languages vary in how they describe events, with "verb-framed" languages like Spanish marking direction on the verb, while "satellite-framed" languages like English use particles to indicate direction, impacting translation approaches.

### Morphological Typology
- **Morpheme Use**: Languages range from single-morpheme words (e.g., Vietnamese) to complex words combining many morphemes (e.g., Yupik).

- **Morpheme Boundaries**: Agglutinative languages (e.g., Turkish) have clear morpheme separations, while fusion languages (e.g., Russian) combine multiple meanings in one affix, requiring subword models for translation.

### Referential density
- **Pronoun Omission**: Some languages (e.g., Spanish, Chinese, Japanese) often omit pronouns ("pro-drop"), requiring listeners to infer the subject, while languages like English use explicit pronouns.

- **Referential Density**: Languages with frequent pronoun omission (e.g., Chinese, Japanese) are "referentially sparse" or "cold," relying on inference, whereas more explicit languages (e.g., English) are "referentially dense" or "hot." Translating between these types can be challenging for maintaining clarity.

## Machine Translation using Encoder-Decoder
- **MT Architecture**: The standard MT model uses an `encoder-decoder transformer` (`sequence-to-sequence`) to generate target language sentences from source language sentences independently.

- **Objective**: MT systems are trained with supervised learning on parallel sentences, maximizing the probability of target tokens $P(y_1, \dots, y_m | x_1, \dots, x_n)$ given source tokens $x_1, \dots, x_n$.

- **Encoder-Decoder Process**: The encoder produces an intermediate context $𝐡 = \text{encoder}(x)$ , and the decoder uses $𝐡$ to generate each output token sequentially, $y_{t+1} = \text{decoder}(𝐡, y_1, \dots, y_t)$ for $t \in [1, \dots, m]$.

### Tokenization
- **Subword Tokenization**: MT uses shared subword tokenization for source and target languages, enabling translation between languages with different word-separation rules.

- **Wordpiece Algorithm**: Wordpiece tokenization, used in BERT, builds vocabulary by merging tokens to maximize language model probability, up to a specified vocabulary size.

- **Unigram (SentencePiece) Algorithm**: Unigram tokenization starts with a large vocabulary and reduces it by removing low-probability tokens, creating more meaningful subwords.

- **Unigram Advantage**: Unigram tokenization captures semantically relevant tokens better than BPE, avoiding overly small or common token fragments.

### Creating the Training data
- **Parallel Corpora**: MT models are trained on parallel corpora (bitexts), with large datasets like Europarl, the UN Parallel Corpus, and OpenSubtitles providing millions of sentence pairs in multiple languages.

- **Sentence Alignment** takes sentences $e_1, ⋯, e_n$, and $f_1 , ⋯, f_n$ and finds minimal sets of sentences that are translations of each other, including 
  - single sentence mappings like $(e_1 ,f_1), (e_4 ,f_3), (e_5 ,f_4), (e_6 ,f_6)$ 
  - as well as 2-1 alignments $(e_2/e_3 ,f_2), (e_7 /e_8 ,f_7)$, 
  - and null alignments $(f_5 )$.
  - `Sentence Alignment` for new corpora requires a cost function to score translation likelihood and an alignment algorithm, often using dynamic programming based on the minimum edit distance.

![A sample alignment between sentences in English and French](./images/mt/align.png)

- **Multilingual Embedding**: Sentence similarity is scored using cosine similarity in a multilingual embedding space, with the [cost function](https://aclanthology.org/D19-1136.pdf) helping to align sentence spans.

- **Corpus Cleanup**: Noisy sentence pairs are removed through rules or by ranking pairs based on their multilingual cosine scores to ensure high-quality training data.

## Details of the Encoder-Decoder Model
- **Encoder-Decoder Transformer Architecture**: The standard architecture for MT is the encoder-decoder transformer, consisting of an encoder (standard transformer) and a decoder with an additional cross-attention layer to attend to the source language.

![The encoder-decoder transformer architecture for machine translation](./images/mt/de.png)

- **Decoding Process**: The decoder generates target language words one by one, conditioned on the source sentence and previously generated words, using techniques like beam search for decoding.

- **Cross-Attention Layer**: The decoder includes a cross-attention layer where queries come from the previous decoder layer, and keys and values come from the encoder's output, allowing the decoder to focus on source language tokens.

- **Attention Mechanism**: The attention mechanism in the decoder is a mix of cross-attention (to the encoder's output) and causal (left-to-right) multi-head attention, while the encoder’s multi-head attention can look at the entire source text.

![The transformer block for the encoder and the decoder.](./images/mt/dbblk.png)

- **Training and Loss Function**: The model is trained autoregressively using cross-entropy loss, with teacher forcing where the decoder is given the actual target token from the training data at each time step, not the model’s own prediction.

## Decoding in MT: Beam Search
- **Greedy Decoding Limitation**: Greedy decoding selects the word with the highest probability at each timestep, but it can make wrong choices since it doesn’t consider future context: 
  - `yes yes` instead of `ok ok` is generated.
  
  ![Greedy Decoding Limitation](./images/mt/greedy.png)
  
  - which beam search addresses by keeping multiple hypotheses.

- **Beam Search** is a heuristic search method that keeps k-best possible tokens at each timestep, where k is the beam width, helping balance memory usage and computation.

![Beam search decoding with a beam width of k = 2.](./images/mt/beam.png)

- **Hypothesis Extension**: At each step, k-best hypotheses are extended by generating all possible next tokens and scoring them based on the probability of the current word and the previous path, pruning to keep only the k-best.

- **Log Probability Scoring**: The score of each hypothesis is computed using the chain rule of probability, where the log probability of the full sequence is the sum of the log probabilities of each word conditioned on previous words.

- **Handling Different Lengths**: Completed hypotheses might have different lengths, so length normalization methods are used, such as dividing the log probability by the number of words to adjust for language models' tendency to prefer shorter sequences.

- **Decoding Process**: Beam search continues until an EOS (End Of Sentence) token is generated, indicating a complete translation. The size of the beam is reduced progressively as hypotheses are completed.

- **Final Selection**: The result of beam search is a set of k hypotheses, and the most probable one can be selected for the final translation, or all k hypotheses can be passed to downstream applications.

![Scoring for beam search decoding with a beam width of k = 2.](./images/mt/score.png)

- **Beam Width in MT**: Typical beam widths for machine translation are between 5 and 10, with each width offering a trade-off between computational cost and translation quality.

### Minimum Bayes Risk Decoding
- **Minimum Bayes Risk (MBR) decoding** chooses the translation with the least expected error, aiming to maximize a `goodness-of-fit metric` (e.g., chrF, BERTScore) rather than just the highest probability translation.

- **Approximating Perfect Translations**: Since the perfect set of translations is unknown, MBR uses a smaller set of candidate translations, selecting the one that is most similar to all others, based on a similarity or alignment function.

- **Application in NLP**: MBR decoding, effective in machine translation, has also been successfully applied to other NLP tasks such as speech recognition, summarization, dialogue systems, and image captioning.

## Translating in low-resource situations
- **Data Scarcity**: Many languages lack large parallel corpora, especially for low-resource domains.
  - **Backtranslation**: Uses monolingual data to generate synthetic parallel text, improving translation for low-resource languages.
    - **Backtranslation Effectiveness**: It works well, providing about 2/3 of the gain compared to training with natural bitext.
- **Data Quality**: Many parallel corpora for low-resource languages suffer from poor quality due to insufficient native speaker input.
  - **Multilingual Models**: Use multiple language pairs to improve translation for low-resource languages by leveraging related, higher-resource languages.
    - **Multilingual Data Quality**: Large multilingual models can improve translations but often rely on English-centered corpora.
  - **Participatory Design**: Involves native speakers and local experts in developing MT systems for low-resource languages.
  - **Evaluation Methods**: Post-editing MT output is suggested for better error measurement and evaluation in low-resource languages.
  - **Improved MT Models**: New initiatives are expanding multilingual systems to cover more languages and improve translation quality.
- **Socio-Technical Issues**: Low-resource language projects often lack native speaker involvement in content curation and evaluation.

## [MT Evaluation](https://machinetranslate.org/metrics)
- MT is evaluated on 
  - **adequacy**: how well the translation conveys the meaning,
  - **fluency**: how natural and grammatically correct the translation is.

- **Using Human Raters to Evaluate MT**  
  - Human raters assess translations based on adequacy and fluency using scales or rankings.
  - Training is necessary for raters to distinguish between fluency and adequacy, and to standardize evaluations.
  - Post-editing translations is another method to evaluate quality, measuring the difference between original MT output and post-edited text.

- **Automatic Evaluation**  
  - **[chrF (character F-score)](https://huggingface.co/spaces/evaluate-metric/chrf)** is a robust metric based on character n-gram overlap, and is often more reliable than other metrics.
  - **[BLEU (Bilingual Evaluation Understudy)](https://huggingface.co/spaces/evaluate-metric/bleu)** is another popular word-based overlap metric but has limitations in languages with complex morphology or different tokenization.
  - **Statistical Significance Testing** using methods like the paired bootstrap test helps assess the significance of differences in scores between two systems.

- **Automatic Evaluation: Embedding-Based Methods**  
  - Embedding-based metrics like **[BERTSCORE](https://huggingface.co/spaces/evaluate-metric/bertscore)** measure translation quality based on token similarity in embeddings.
  
  ![The computation of BERTSCORE recall from reference x and candidate x̂](./images/mt/bertscore.png)
  
  - **[COMET](https://unbabel.github.io/COMET/html/index.html)** and **[BLEURT](https://huggingface.co/spaces/evaluate-metric/bleurt)** are trained on human-labeled datasets to predict translation quality.
  - These embedding-based methods address the issue of synonyms and paraphrasing by considering semantic meaning.

In [None]:
# Install Huggingface core libraries
!pip install tokenizers transformers datasets accelerate

In [None]:
# https://huggingface.co/learn/nlp-course/chapter7
# 1. Explore a dataset for translating Chinese to English
# https://huggingface.co/datasets/suolyer/translate_zh2en
from datasets import load_dataset

raw_datasets = load_dataset("suolyer/translate_zh2en")

In [None]:
raw_datasets

In [None]:
raw_datasets['train'][1]

In [None]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

In [None]:
split_datasets["validation"] = split_datasets.pop("test")

In [None]:
# 2. Explore a model for translation
# Use a pipeline as a high-level helper
from transformers import pipeline
model="Helsinki-NLP/opus-mt-zh-en"
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-zh-en")
translator('今天是个好日子。')

In [None]:
translator('啥？今天是好日子。鬼才信。')

In [None]:
translator('两岸猿声啼不住，轻舟已过万重山。')

In [None]:
# 3. Process the data
# The Helsinki-NLP organization provides more than a thousand models in multiple languages.
from transformers import AutoTokenizer
model="Helsinki-NLP/opus-mt-zh-en"
tokenizer = AutoTokenizer.from_pretrained(model, return_tensors="pt")

zh_sentence = split_datasets["train"][1]["input"]
en_sentence = split_datasets["train"][1]["output"]

inputs = tokenizer(zh_sentence, text_target=en_sentence)
inputs

In [None]:
zh_sentence, en_sentence

In [None]:
# Wrong tokenization: tokenize English sentence with a Chinese tokenizer
# It results in a lot more tokens due the Chinese tokenizer does't know any English words
wrong_targets = tokenizer(en_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(inputs["labels"]))

In [None]:
# Define the preprocessing function we will apply on the datasets:
max_length = 128
def preprocess_function(examples):
    inputs = examples['input']
    targets = examples['output']
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [None]:
split_datasets

In [None]:
# Apply that preprocessing in one go on all the splits of our dataset:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

In [None]:
# Fine-tuing the model with the Trainer API
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model)

In [None]:
# Deal with the padding for dynamic batching by data collator
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
# test on a fiew samples
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

In [None]:
# the padding value used to pad the labels should be -100
# not the padding token of the tokenizer,
# to make sure those padded values are ignored in the loss computation.
batch["labels"]

In [None]:
# the decoder input IDs are shifted versions of the labels
batch["decoder_input_ids"]

In [None]:
for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])

In [None]:
# 3. Metrics
# - [sacreBLEU](https://github.com/mjpost/sacrebleu)

!pip install sacrebleu evaluate

import evaluate
metric = evaluate.load("sacrebleu")

In [None]:
# A good prediction
# The score can go from 0 to 100, and higher is better.
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

In [None]:
# bad predictions
predictions = ["This This This This"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

In [None]:
predictions = ["This plugin"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

In [None]:
# Convert the model outputs to texts the metric can use
# clean up all the -100s in the labels
# the tokenizer will automatically do the same for the padding token
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

In [None]:
# 4. Fine-tuning the model
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"finetuned-Helsinki-NLP-opus-mt-zh-en",
    eval_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)


In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
# evaluate before training to give a baseline
trainer.evaluate(max_length=max_length)

In [None]:
# train
trainer.train()

In [None]:
# evaluate after training to see any improvement
trainer.evaluate(max_length=max_length)

In [None]:
# 5. A custom training loop
# Preparing everything for training

from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)


In [None]:
model_checkpoint="Helsinki-NLP/opus-mt-zh-en"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    output_dir = './'
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        print(f"Training in progress epoch {epoch}")

In [None]:
# 6. Using the fine-tuned model for inference
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("./")
tokenizer = AutoTokenizer.from_pretrained("./")

# Prepare input text for translation
input_text = """
望庐山瀑布
唐·李白
日照香炉生紫烟，
遥看瀑布挂前川。
飞流直下三千尺，
疑是银河落九天。
"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# Generate translation
with torch.no_grad():
    translated_tokens = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=128,
        num_beams=4
    )

# Decode and print the translation
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print(f"Translated text: {translated_text}")
