# NLP 2 Project: Backtranslation for Domain Adaptation

In this project, you will fine-tune a translation model by backtranslating monolingual in-domain text. You will then test performance in that domain as well as general domains.

Your first task is to compare fine-tuning with backtranslation.
Next, you will explore a method of data selection.
Third, you will extend backtranslation, either modifying decoding, the model, or using multilingual pivots.
Finally, you will explore your own research question.

This notebook provides starter code to preprocess, fine-tune, and generate with a translation model. This is enough to get you started on the task.

In [None]:
# set up environment
# if using colab, mount google drive:
# from google.colab import drive
# import os, sys
# drive.mount('/content/drive/')
# nb_path = '/content/notebooks'
# os.symlink('/content/drive/My Drive/Colab Notebooks', nb_path)
# sys.path.insert(0,nb_path)
# !pip install --target=$nb_path ...

!pip install transformers==4.49 datasets evaluate torch sacremoses sacrebleu unbabel-comet

In [None]:
# imports
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset, Dataset, DatasetDict
from evaluate import load
import numpy as np
# import vllm
from tqdm import tqdm


## Preprocessing
First, we need to tokenize our inputs. With HF Transformers, this is fairly simple and is done for you below. Here, we use the model's tokenizer to split the inputs into the model's pre-defined numerical tokens, i.e. convert text into tensors. We also need a function to convert back from tensors into text.

In [None]:
# function to tokenize dataset for translation

def preprocess_data(dataset_dict, tokenizer, src_lang, tgt_lang, split, max_length=128):
    """
    Preprocess translation datasets

    Args:
        dataset_dict: Dictionary containing train/dev/test datasets
        tokenizer: Tokenizer object
        src_lang: Source language code
        tgt_lang: Target language code
        split: Dataset split to preprocess ('train', 'validation', etc)
        max_length: Maximum sequence length
    Returns:
        tokenized_dataset: Preprocessed dataset for specified split
    """
    def preprocess_function(examples):
        inputs = examples[src_lang]
        targets = examples[tgt_lang]

        model_inputs = tokenizer(
            inputs,
            max_length=max_length,
            truncation=True,
            padding='max_length',
            return_tensors="pt"
        )

        labels = tokenizer(
            targets,
            max_length=max_length,
            truncation=True,
            padding='max_length',
            return_tensors="pt"
        )

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    tokenized_dataset = dataset_dict[split].map(
        preprocess_function,
        batched=True,
        remove_columns=dataset_dict[split].column_names
    )

    return tokenized_dataset

def postprocess_predictions(predictions, labels, tokenizer):
    """
    Convert model outputs to decoded text

    Args:
        predictions: Model predictions
        labels: Ground truth labels
        tokenizer: Tokenizer object
    Returns:
        decoded_preds: Decoded predictions
        decoded_labels: Decoded labels
    """
    predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    return decoded_preds, decoded_labels



## Evaluation
During fine-tuning, we need to see how good the outputs are on our dev set. For this, we can use BLEU score (Papineni 2002). This function decodes the predicted tensor tokens, and computes the BLEU score.

On our test sets, we also want to calculate an automatic metric, but on decoded text. We can use BLEU again, but also more advanced metrics like COMET. It's up to you to implement your choice of metric. We will discuss some metrics from the literature in class. It's always good to use at least 2 metrics.

In [None]:
# evaluation: for validation (with raw outputs) and testing (from text)

def compute_metrics_val(tokenizer, eval_preds):
    """
    Calculate BLEU score for predictions

    Args:
        tokenizer: Tokenizer object
        eval_preds: Tuple of predictions and labels
    Returns:
        metrics: Dictionary containing BLEU score
    """
    preds, labels = eval_preds
    decoded_preds, decoded_labels = postprocess_predictions(preds, labels, tokenizer)

    # Calculate BLEU score
    bleu = load("sacrebleu")
    results = bleu.compute(predictions=decoded_preds, references=[[l] for l in decoded_labels])

    return {"bleu": results["score"]}

def compute_metrics_test(src, tgt, preds, bleu=True, comet=False):
    """
    Calculate BLEU score for predictions

    Args:
        src: Source language texts
        tgt: Target language texts
        preds: Predicted texts
        bleu: Whether to calculate BLEU score
        comet: Whether to calculate COMET score
    Returns:
        metrics: Dictionary containing BLEU score
    """
    if bleu:
        bleu = load("sacrebleu")
        results = bleu.compute(predictions=preds, references=[[l] for l in tgt])
        score = results["score"]
    if comet:
      raise NotImplementedError("COMET not implemented yet")
        # Calculate COMET score

    return score

## Fine-tuning
Now that we've tokenized our data and got our evaluation ready, we can start fine-tuning (i.e., training from a pre-trained model). This is a minimal training loop.

We also need to generate at test time from a text dataset. This function involves generation without calculating gradients.

In [None]:
# basic training loop

def train_model(model_name, tokenized_datasets, tokenizer, training_args):
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Verify GPU usage
    if not torch.cuda.is_available():
        print("WARNING: No GPU detected! Training will be slow.")
    else:
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["dev"] if "dev" in tokenized_datasets else None,
        tokenizer=tokenizer,
        data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
        compute_metrics=lambda x: compute_metrics_val(tokenizer, x)
    )

    trainer.train()
    return model

# generation (on GPU) for test time
def translate_text(texts, model, tokenizer, max_length=128, batch_size=32):
    """
    Translate texts using the model

    Args:
        texts: List of texts to translate
        model: Translation model
        tokenizer: Tokenizer object
        max_length: Maximum sequence length
        batch_size: Batch size for translation
    Returns:
        translations: List of translated texts
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    model.eval()
    translations = []

    # Create tqdm progress bar
    progress_bar = tqdm(range(0, len(texts), batch_size), desc="Translating")

    for i in progress_bar:
        batch = texts[i:i + batch_size]
        inputs = tokenizer(
            batch,
            return_tensors="pt",
            max_length=max_length,
            truncation=True,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=max_length,
                temperature=0.0,
                early_stopping=True
            )

        batch_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)

    return translations


## Final Setup
We now have all the ingredients to run our experiments. This is all standard training code; the interesting results come from what you do with the data. Below, we give an initial setup for getting the code running (either in Colab or on Snellius).

In [None]:

SRC_LANG = "en"
TGT_LANG = "ru"
MODEL_NAME = f"facebook/wmt19-{SRC_LANG}-{TGT_LANG}"
TRAIN_DATASET_NAME = "sethjsa/flores_en_ru"
DEV_DATASET_NAME = "sethjsa/tico_en_ru"
TEST_DATASET_NAME = "sethjsa/tico_en_ru"
OUTPUT_DIR = "./results"

train_dataset = load_dataset(TRAIN_DATASET_NAME)
dev_dataset = load_dataset(DEV_DATASET_NAME)
test_dataset = load_dataset(TEST_DATASET_NAME)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# change the splits for actual training. here, using flores-dev as training set because it's small (<1k examples)
tokenized_train_dataset = preprocess_data(train_dataset, tokenizer, SRC_LANG, TGT_LANG, "dev")
tokenized_dev_dataset = preprocess_data(dev_dataset, tokenizer, SRC_LANG, TGT_LANG, "dev")
tokenized_test_dataset = preprocess_data(test_dataset, tokenizer, SRC_LANG, TGT_LANG, "test")

tokenized_datasets = DatasetDict({
    "train": tokenized_train_dataset,
    "dev": tokenized_dev_dataset,
    "test": tokenized_test_dataset
})

# modify these as you wish; RQ3 could involve testing effects of various hyperparameters
training_args = Seq2SeqTrainingArguments(
    torch_compile=True, # generally speeds up training, try without it to see if it's faster for small datasets
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32, # change batch sizes to fit your GPU memory and train faster
    per_device_eval_batch_size=128,
    weight_decay=0.01,
    optim="adamw_torch",
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
    save_total_limit=1, # modify this to save more checkpoints
    num_train_epochs=1, # modify this to train more epochs
    predict_with_generate=True,
    generation_num_beams=4,
    generation_max_length=128,
    no_cuda=False,  # Set to False to enable GPU
    fp16=True,      # Enable mixed precision training for faster training
)


In [None]:
# fine-tune model
model = train_model(MODEL_NAME, tokenized_datasets, tokenizer, training_args)

# test model
predictions = translate_text(test_dataset["test"][SRC_LANG], model, tokenizer, max_length=128, batch_size=64)
print(predictions)

eval_score = compute_metrics_test(test_dataset["test"][SRC_LANG], test_dataset["test"][TGT_LANG], predictions, bleu=False, comet=True)
print(eval_score)

You will find all the datasets for this project under: https://huggingface.co/sethjsa

For other models, consider "Helsinki-NLP/opus-mt-en-ru" (general MT model), "glazzova/translation_en_ru" (tuned on biomedical domain), or "facebook/m2m100_418M" (multilingual model with 100 languages -- consider using for multilingual pivot experiments).

To read more about the WMT Biomedical test data, see here: https://aclanthology.org/2022.wmt-1.69/

## Snellius advice

If using snellius, I recommend converting the above into a training script. Then you can submit jobs with slurm. See advice here:
https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/30660217/Creating+and+running+jobs

https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/30660220/Writing+a+job+script

https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/30660228/Interacting+with+the+job+queue

https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/30660234/Example+job+scripts

# Advanced



ONLY if you have GPU hours left and want to generate backtranslations with an LLM, consider using vLLM for faster generation. An example function is given below.

In [None]:

# if using LLM for generation, consider using vllm for faster generation
def translate_text_vllm(texts, model_name, tokenizer, max_length=128, batch_size=32):
    """
    Translate texts using vllm for faster generation

    Args:
        texts: List of texts to translate
        model_name: Name or path of the model (str)
        tokenizer: Tokenizer object
        max_length: Maximum sequence length
        batch_size: Batch size for translation
    Returns:
        translations: List of translated texts
    """
    # Use model_name instead of model object
    llm = vllm.LLM(
        model=model_name,  # Changed from model to model_name
        tokenizer=tokenizer,
        tensor_parallel_size=1,
        max_num_batched_tokens=max_length * batch_size
    )

    # Create sampling params
    sampling_params = vllm.SamplingParams(
        temperature=0.0,  # Equivalent to greedy decoding
        max_tokens=max_length,
        stop=None
    )

    # Generate translations in batches
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        outputs = llm.generate(batch, sampling_params)

        # Extract generated text from outputs
        batch_translations = [output.outputs[0].text for output in outputs]
        translations.extend(batch_translations)

    return translations