**CST8507: Natural Language Processing**

Group Project- Shelly, Steffi

Objectives:
1. Select a field of interest and gather relevant textual data.
2. Develop a system using a pre-trained transformer model.
3. Evaluate and present the system performance with a demo and report. The evaluation must be done according to the metrics explained in your proposal.

**STEP 1: Install and Import Required Libraries**

We’ll use Hugging Face's Transformers library.

In [1]:
import torch
torch.cuda.empty_cache()


In [2]:
'''Install Dependencies'''
!pip install transformers datasets sentencepiece evaluate sacrebleu
!pip install nltk


Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py

In [3]:
'''Import Dependencies'''
from transformers import MarianTokenizer, MarianMTModel, Seq2SeqTrainer, Seq2SeqTrainingArguments
import torch
from datasets import load_dataset, Dataset
import numpy as np
import requests
import zipfile
import os
import json
from tqdm import tqdm
import sacrebleu
!pip install unbabel-comet --quiet



[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.0/91.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.4/101.4 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m825.4/825.4 kB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Step 2: Downloading and Extracting the EA-MT Datasets**

In this cell, we programmatically download all the official EA-MT data splits (sample, validation, training, test, and prediction data) directly from the competition website using Python requests.  
After downloading, we unzip each file into its own dedicated folder and remove the original zip file to keep our Colab environment tidy.

This allows us to keep our data organized and prepares us for subsequent data loading and processing.


In [4]:
# List of (folder_name, url) for all datasets
dataset_links = [
    ("validation", "https://sapienzanlp.github.io/ea-mt/assets/files/semeval.validation.v2-889a1492ba6c3791baa8f4224bc8e685.zip"),
    ("train",      "https://sapienzanlp.github.io/ea-mt/assets/files/semeval.train.v2-e0d1c28b78c8dd4969d25eea5d3bc9cc.zip"),
    ("test",       "https://sapienzanlp.github.io/ea-mt/assets/files/semeval.test_hidden.v2-95720e10dd4cf884e23927f0d5f892f6.zip"),
    ("predictions","https://sapienzanlp.github.io/ea-mt/assets/files/semeval.predictions.v2-348f83c7ccc3ec827a2a3ddbe220278b.zip"),
]

for name, url in dataset_links:
    zip_filename = f"{name}.zip"
    print(f"Downloading {name} data...")
    r = requests.get(url)
    with open(zip_filename, "wb") as f:
        f.write(r.content)
    print(f"Unzipping {zip_filename}...")
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        extract_path = f"./{name}"
        os.makedirs(extract_path, exist_ok=True)
        zip_ref.extractall(extract_path)
    print(f"Removing {zip_filename}...")
    os.remove(zip_filename)
    print(f"{name.capitalize()} data ready in {extract_path}/\n")

Downloading validation data...
Unzipping validation.zip...
Removing validation.zip...
Validation data ready in ./validation/

Downloading train data...
Unzipping train.zip...
Removing train.zip...
Train data ready in ./train/

Downloading test data...
Unzipping test.zip...
Removing test.zip...
Test data ready in ./test/

Downloading predictions data...
Unzipping predictions.zip...
Removing predictions.zip...
Predictions data ready in ./predictions/



**Step 3: Loading JSONL Data Files**

Each data split (train, validation, test, sample) is provided in `.jsonl` (JSON Lines) format, where each line contains a single data point in JSON format.

We define a function to load any `.jsonl` file into a Python list of dictionaries for easy manipulation and processing.  
We then load the German split (`de_DE.jsonl`) from each data folder into memory.


In [5]:
import json

def load_jsonl(filepath):
    data = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# Adjust the paths based on your extraction folders
train_data = load_jsonl('train/semeval/train/de/train.jsonl')
val_data = load_jsonl('validation/validation/de_DE.jsonl')
test_data = load_jsonl('test/test_without_targets/de_DE.jsonl')  # For test, targets are hidden
prediction_data = load_jsonl('predictions/predictions/gpt-4o-mini-2024-07-18/validation/de_DE.jsonl')  # gpt predictions


# Function to print dataset stats
def print_dataset_stats(data, name="Dataset"):
    print(f"=== {name} ===")
    print(f"Total samples: {len(data)}")
    print("Sample entry:", data[0])
    print("Keys in first entry:", list(data[0].keys()))
    print()

# Assuming you already loaded your datasets as:
# train_data, val_data, test_data
print_dataset_stats(train_data, "Train Data")
print_dataset_stats(val_data, "Validation Data")
print_dataset_stats(test_data, "Test Data")
print_dataset_stats(prediction_data, "Prediction Data")



=== Train Data ===
Total samples: 4087
Sample entry: {'id': 'a9011ddf', 'source_locale': 'en', 'target_locale': 'de', 'source': 'What is the seventh tallest mountain in North America?', 'target': 'Wie heißt der siebthöchste Berg Nordamerikas?', 'entities': ['Q49'], 'from': 'mintaka'}
Keys in first entry: ['id', 'source_locale', 'target_locale', 'source', 'target', 'entities', 'from']

=== Validation Data ===
Total samples: 731
Sample entry: {'id': 'Q100268160_0', 'wikidata_id': 'Q100268160', 'entity_types': ['TV series'], 'source': 'Who played the lead role in The Mole – Undercover in North Korea?', 'targets': [{'translation': 'Wer spielte die Hauptrolle in Der Maulwurf: Undercover in Nordkorea?', 'mention': 'Der Maulwurf: Undercover in Nordkorea'}], 'source_locale': 'en', 'target_locale': 'de'}
Keys in first entry: ['id', 'wikidata_id', 'entity_types', 'source', 'targets', 'source_locale', 'target_locale']

=== Test Data ===
Total samples: 5876
Sample entry: {'id': 'bc577b19fe3bd34e',

**Step 4: Extracting English-German Pairs and Entities/Mentions**

The EA-MT datasets use two slightly different data formats:
- The **training data** uses "source" and "target" fields (single English and German strings), plus "entities" (list of Wikidata IDs).
- The **validation/sample/test data** use a "source" field and a "targets" list (containing "translation" and "mention" fields).

We implement a unified extraction function that detects the data format and builds a standard list of examples, with each example containing the source sentence, target translation (if present), and either the entity IDs or mention text (when available).

This standardization allows us to process all data splits in a consistent manner for training, validation, and testing.


In [6]:
def detect_and_extract_examples(data):
    # Check the keys of the first item to determine the format
    first = data[0]
    examples = []
    if "targets" in first:  # sample/validation/test
        for entry in data:
            src = entry.get("source")
            tgt = entry["targets"][0].get("translation") if entry.get("targets") else None
            mention = entry["targets"][0].get("mention") if entry.get("targets") else None
            examples.append({"source": src, "target": tgt, "mention": mention})
    elif "target" in first:  # training
        for entry in data:
            src = entry.get("source")
            tgt = entry.get("target")
            entities = entry.get("entities", [])
            examples.append({"source": src, "target": tgt, "entities": entities})
    elif "prediction" in first:  # training
        for entry in data:
            src = entry.get("text")
            tgt = entry.get("prediction")
            entities = entry.get("entities", [])
            examples.append({"source": src, "target": tgt, "entities": entities})
    else:
        raise ValueError("Unknown data format!")
    return examples

examples_train = detect_and_extract_examples(train_data)
examples_val = detect_and_extract_examples(val_data)
examples_test = detect_and_extract_examples(test_data)
examples_test_pred = detect_and_extract_examples(prediction_data)


**Step 5: Exploring Data - Printing Stats and Sample Entries**

To better understand our data, we print:
- The total number of examples in each split.
- The first entry (to verify data loading and schema).
- The first few English-German pairs (with mentions/entities when present).

This exploration helps us spot issues, understand the annotation structure, and ensure we're using the correct fields for training and evaluation.


In [7]:
def print_stats_and_examples(examples, name="Dataset", n=3):
    print(f"=== {name} ===")
    print(f"Total samples: {len(examples)}")
    print("First sample:", examples[0])
    print()
    for i, ex in enumerate(examples[:n]):
        print(f"Example {i+1}:")
        print(f"  EN: {ex['source']}")
        print(f"  DE: {ex.get('target')}")
        if 'mention' in ex:
            print(f"  Mention: {ex['mention']}")
        if 'entities' in ex:
            print(f"  Entities: {ex['entities']}")
        print()

print_stats_and_examples(examples_train, "Train Data")
print_stats_and_examples(examples_val, "Validation Data")
print_stats_and_examples(examples_test, "Test Data")
print_stats_and_examples(examples_test_pred, "GPT Mini Predictions")


=== Train Data ===
Total samples: 4087
First sample: {'source': 'What is the seventh tallest mountain in North America?', 'target': 'Wie heißt der siebthöchste Berg Nordamerikas?', 'entities': ['Q49']}

Example 1:
  EN: What is the seventh tallest mountain in North America?
  DE: Wie heißt der siebthöchste Berg Nordamerikas?
  Entities: ['Q49']

Example 2:
  EN: What year was the first book of the A Song of Ice and Fire series published?
  DE: In welchem Jahr wurde das erste Buch der Reihe "Das Lied von Eis und Feuer" veröffentlicht?
  Entities: ['Q45875']

Example 3:
  EN: Who is the youngest current US governor?
  DE: Wer ist derzeit der jüngste amerikanische Gouverneur?
  Entities: ['Q889821']

=== Validation Data ===
Total samples: 731
First sample: {'source': 'Who played the lead role in The Mole – Undercover in North Korea?', 'target': 'Wer spielte die Hauptrolle in Der Maulwurf: Undercover in Nordkorea?', 'mention': 'Der Maulwurf: Undercover in Nordkorea'}

Example 1:
  EN: Who 

**STEP: 6 MarianMT Baseline Translation and Evaluation (EN→DE)**

This code below evaluates the pretrained MarianMT model (`Helsinki-NLP/opus-mt-en-de`) for English-to-German machine translation as a baseline on the validation set of our EA-MT project.

### Key Steps

1. **Model Loading:**  
   Loads the MarianMT model and tokenizer from HuggingFace. The model is used as-is (no fine-tuning) to establish a baseline.

2. **Batch Translation:**  
   Translates the English validation sources to German in batches for efficiency, using the MarianMT model.

3. **COMET Scoring:**  
   Computes COMET scores for the translated outputs using the `Unbabel/wmt22-comet-da` model. COMET is a state-of-the-art reference-based MT evaluation metric that correlates well with human judgments.

4. **Entity-level Evaluation:**  
   For each validation example with a `mention` field, calculates:
   - **Entity Precision**: Correct entity mentions in prediction / all entity mentions predicted
   - **Entity Recall**: Correct entity mentions in prediction / all reference entity mentions
   - **Entity F1**: Harmonic mean of precision and recall
   - **Mention Coverage**: Proportion of examples where the mention is covered in the translation
   - **Entity Accuracy**: Proportion of examples where the mention appears in the prediction

5. **BLEU Score:**  
   Also computes the traditional BLEU score for reference.

6. **Saving Results:**  
   All sources, references, predictions, mentions, and COMET scores are saved to a CSV file for further analysis.

---

**This baseline evaluation will allow us to compare future improvements made by fine-tuning or additional techniques.**


In [25]:
# --- 3. Load MarianMT Pretrained Model (EN->DE) ---
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# --- 4. Batched Translation Function ---
def translate_batch(texts, batch_size=16):
    translations = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Translating"):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(device)
        with torch.no_grad():
            translated = model.generate(**inputs, max_length=128, num_beams=4)
        outputs = tokenizer.batch_decode(translated, skip_special_tokens=True)
        translations.extend(outputs)
    return translations

# --- 5. Prepare Data for Evaluation ---
sources = [ex["source"] for ex in examples_val]
refs = [ex["target"] for ex in examples_val]
mentions = [ex.get("mention") for ex in examples_val]  # Could be None

# --- 6. Translate Validation Set ---
preds = translate_batch(sources, batch_size=8)



Translating: 100%|██████████| 92/92 [00:18<00:00,  4.93it/s]


In [26]:
# --- 7. Calculate COMET Score ---
# Use the "Unbabel/wmt22-comet-da" COMET model (no registration needed)
from comet import download_model, load_from_checkpoint

comet_model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(comet_model_path)

comet_data = [{"src": src, "mt": mt, "ref": ref} for src, mt, ref in zip(sources, preds, refs)]
comet_scores = comet_model.predict(comet_data, batch_size=16, gpus=1 if device == "cuda" else 0)

print(f"\nAverage COMET score: {np.mean(comet_scores['scores']):.4f}")

# --- 8. Entity-level Metrics (Precision, Recall, F1, Coverage) ---
def entity_metrics(mentions, refs, preds):
    # Only for examples where 'mention' exists (non-empty)
    has_mention = 0
    covered = 0
    tp, fp, fn = 0, 0, 0  # For F1 calculation
    for mention, ref, pred in zip(mentions, refs, preds):
        if not mention:
            continue  # Skip if no mention
        has_mention += 1
        # Coverage: is the mention present in the prediction?
        if mention in pred:
            covered += 1
        # Entity-level precision, recall, F1 (as substring match for simplicity)
        # If you have a list of entities, split/parse as needed
        pred_entities = [mention] if mention else []
        ref_entities = [mention] if mention else []
        # True Positive: predicted mention AND mention is in reference
        tp += int((mention in pred) and (mention in ref))
        fp += int((mention in pred) and (mention not in ref))
        fn += int((mention not in pred) and (mention in ref))
    # Compute metrics
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2*prec*rec/(prec+rec) if (prec+rec) > 0 else 0
    coverage = covered / has_mention * 100 if has_mention > 0 else 0
    entity_acc = covered / has_mention if has_mention > 0 else 0
    return prec, rec, f1, coverage, entity_acc, has_mention

prec, rec, f1, coverage, entity_acc, count = entity_metrics(mentions, refs, preds)

print(f"\nEntity-level metrics (on {count} examples with mentions):")
print(f"  Precision:        {prec:.4f}")
print(f"  Recall:           {rec:.4f}")
print(f"  F1 Score:         {f1:.4f}")
print(f"  Mention Coverage: {coverage:.2f}%")
print(f"  Entity Accuracy:  {entity_acc:.4f}")

# --- 9. BLEU Score for Reference ---
from nltk.translate.bleu_score import corpus_bleu

bleu = corpus_bleu([[ref.split()] for ref in refs], [pred.split() for pred in preds])
print(f"\nBLEU score: {bleu:.4f}")

# --- 10. Save Results ---
import pandas as pd

df = pd.DataFrame({
    "source": sources,
    "reference": refs,
    "prediction": preds,
    "mention": mentions,
    "comet_score": comet_scores['scores']
})
df.to_csv("marianmt_val_results.csv", index=False)
print("\nSaved results to marianmt_val_results.csv")


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
INFO:pytorch_lightning.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_


Average COMET score: 0.8673

Entity-level metrics (on 731 examples with mentions):
  Precision:        1.0000
  Recall:           0.1950
  F1 Score:         0.3264
  Mention Coverage: 19.29%
  Entity Accuracy:  0.1929

BLEU score: 0.4245

Saved results to marianmt_val_results.csv


**STEP: 7 Fine tuning the base model on our dataset and compare evaluation results**

**Improving Entity Translation**

To improve entity handling, we explicitly mark named entities in both the source and target sentences using special tokens (e.g., [ENT] and [/ENT]). This helps the model learn the boundaries of important entities and to copy or translate them accurately. We then fine-tune the translation model on this marked data, increasing the chance that entities will be preserved in the output.


In [8]:
def mark_entities_in_text(text, mention):
    if mention and mention in text:
        return text.replace(mention, f'[ENT]{mention}[/ENT]')
    else:
        return text

# Flatten validation for all (source, target, mention) triples
val_pairs_marked = []
for entry in val_data:
    src = entry.get("source")
    for t in entry.get("targets", []):
        tgt = t.get("translation")
        mention = t.get("mention")
        marked_src = mark_entities_in_text(src, mention)    # (optional, mention usually in DE)
        marked_tgt = mark_entities_in_text(tgt, mention)
        val_pairs_marked.append({"en": marked_src, "de": marked_tgt, "mention": mention})


**Mapping Wikidata IDs to Entity Names**

To mark entities in our training data, we query the Wikidata API for German and English labels corresponding to each entity ID.
We then replace occurrences of these labels in the source and target sentences with special [ENT] tags, so the model learns to recognize and preserve entities during translation.


In [9]:
# Collect all unique Wikidata entity IDs from the train set
all_ids = set()
for ex in train_data:
    all_ids.update(ex.get("entities", []))
print(f"Unique Wikidata IDs: {len(all_ids)}")


Unique Wikidata IDs: 1287


In [12]:
import time
def get_wikidata_labels(qids, lang='de'):
    labels = {}
    batch_size = 50
    qids = list(qids)
    for i in range(0, len(qids), batch_size):
        ids = '|'.join(qids[i:i+batch_size])
        url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&ids={ids}&props=labels&languages={lang}&format=json"
        resp = requests.get(url)
        if resp.status_code == 429:
            print(f"Rate limited at batch {i//batch_size+1}, sleeping for 10 seconds and retrying...")
            time.sleep(10)
            resp = requests.get(url)
        if resp.status_code != 200:
            print(f"Failed batch {i//batch_size+1}, status {resp.status_code}. Skipping these IDs.")
            continue
        try:
            data = resp.json()
        except Exception as e:
            print(f"JSON decode error for IDs {ids[:20]}... Skipping batch.")
            continue
        for qid in qids[i:i+batch_size]:
            entity = data.get('entities', {}).get(qid, {})
            label = entity.get('labels', {}).get(lang, {}).get('value')
            if label:
                labels[qid] = label
        time.sleep(2)  # Try 2–5 seconds to be polite and avoid 429s
    return labels

entity_labels_de = get_wikidata_labels(all_ids, lang='de')
entity_labels_en = get_wikidata_labels(all_ids, lang='en')
print(f"Sample label (de):", next(iter(entity_labels_de.items())))
print(f"Number of GERMAN labels fetched: {len(entity_labels_de)}")




Sample label (de): ('Q12544', 'Byzantinisches Reich')
Number of GERMAN labels fetched: 1284


In [13]:

with open("wikidata_labels_de.json", "w", encoding="utf-8") as f:
    json.dump(entity_labels_de, f, ensure_ascii=False, indent=2)

def mark_entities_by_qid(text, entities, entity_labels):
    for qid in entities:
        label = entity_labels.get(qid)
        if label and label in text:
            text = text.replace(label, f"[ENT]{label}[/ENT]")
    return text

train_pairs_marked = []
for ex in train_data:
    src = mark_entities_by_qid(ex["source"], ex.get("entities", []), entity_labels_en)
    tgt = mark_entities_by_qid(ex["target"], ex.get("entities", []), entity_labels_de)
    train_pairs_marked.append({"en": src, "de": tgt})

for i in range(3):
    print("EN:", train_pairs_marked[i]['en'])
    print("DE:", train_pairs_marked[i]['de'])
    print()



EN: What is the seventh tallest mountain in [ENT]North America[/ENT]?
DE: Wie heißt der siebthöchste Berg [ENT]Nordamerika[/ENT]s?

EN: What year was the first book of the [ENT]A Song of Ice and Fire[/ENT] series published?
DE: In welchem Jahr wurde das erste Buch der Reihe "[ENT]Das Lied von Eis und Feuer[/ENT]" veröffentlicht?

EN: Who is the youngest current US [ENT]governor[/ENT]?
DE: Wer ist derzeit der jüngste amerikanische [ENT]Gouverneur[/ENT]?



**Step 8: Preparing Hugging Face Datasets**

For model training, we need parallel lists of English and German sentences.
We extract only those pairs where both source and target are present, and wrap them in the Hugging Face `Dataset` object, which is compatible with the Transformers library.

This enables efficient data processing, batching, and tokenization in the next steps.


In [14]:

# For train data (entities are optional for explainability tasks)
train_hf = Dataset.from_list([{"translation": {"en": p["en"], "de": p["de"]}} for p in train_pairs_marked])
val_hf = Dataset.from_list([{"translation": {"en": p["en"], "de": p["de"]}, "mention": p["mention"]} for p in val_pairs_marked])

train_hf[0]
val_hf[0]


{'translation': {'de': 'Wer spielte die Hauptrolle in [ENT]Der Maulwurf: Undercover in Nordkorea[/ENT]?',
  'en': 'Who played the lead role in The Mole – Undercover in North Korea?'},
 'mention': 'Der Maulwurf: Undercover in Nordkorea'}

**Step 9: Tokenizing the Datasets**

We load the pre-trained MarianMT English-to-German translation model and its corresponding tokenizer from Hugging Face.
Next, we define a function to tokenize our English (source) and German (target) sentences, padding and truncating them as needed for efficient GPU computation.

We then use the Hugging Face Datasets API to apply our tokenization function to all training and validation examples.



In [15]:
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def preprocess_function(batch):
    inputs = tokenizer(batch["translation"]["en"], padding="max_length", truncation=True, max_length=64)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(batch["translation"]["de"], padding="max_length", truncation=True, max_length=64)
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized_train = train_hf.map(preprocess_function)
tokenized_val = val_hf.map(preprocess_function)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

Map:   0%|          | 0/4087 [00:00<?, ? examples/s]



Map:   0%|          | 0/1260 [00:00<?, ? examples/s]

**Step 10: Fine-Tuning the Translation Model**

In this step, we fine-tune the MarianMT model on our custom English-German training pairs.
We set up the training parameters (batch size, epochs, learning rate, evaluation frequency, etc.) and train the model using the Hugging Face `Seq2SeqTrainer` class.

By leveraging the provided validation set, we can monitor our model's performance and avoid overfitting.


In [16]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    learning_rate=3e-5,
    save_total_limit=1,
    eval_strategy="epoch",
    predict_with_generate=True,
    report_to="none",  # disables wandb and other reporting
    fp16=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,0.2103,0.396557
2,0.1317,0.409074
3,0.0976,0.410634




TrainOutput(global_step=3066, training_loss=0.1600810699189168, metrics={'train_runtime': 229.5597, 'train_samples_per_second': 53.411, 'train_steps_per_second': 13.356, 'total_flos': 207813926191104.0, 'train_loss': 0.1600810699189168, 'epoch': 3.0})

**Step 11: Model Evaluation**

**Generating Batched Validation Set Predictions**

We extract all English sentences from the validation data and use our MarianMT model to translate them into German.
To avoid running out of memory in Colab, we use a batch size of 8, but this can be adjusted if memory allows.
Finally, we save one translation per line to a file for further evaluation.



In [17]:
def batched_translate(sources, model, tokenizer, batch_size=8):
    preds = []
    model.eval()
    for i in range(0, len(sources), batch_size):
        batch = sources[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=64)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = model.generate(**inputs)
        batch_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        preds.extend(batch_preds)
    return preds

val_sources = [p["en"] for p in val_pairs_marked]
val_targets = [p["de"] for p in val_pairs_marked]
val_mentions = [p["mention"] for p in val_pairs_marked]

val_preds = batched_translate(val_sources, model, tokenizer, batch_size=4)


**COMET EVALUATION**

For the EA-MT shared task, we evaluate translation quality with COMET (a neural MT metric)
We save our predictions, references, and sources as plain text files, then use the official tools to compute these metrics.
COMET gives a quality score

In [18]:
# Save system output (your model's translations)
with open("system_output.txt", "w", encoding="utf-8") as f:
    for p in val_preds:  # or test_preds
        f.write(p.strip() + "\n")

# Save reference translations (gold german)
with open("reference.txt", "w", encoding="utf-8") as f:
    for r in val_targets:
        f.write(r.strip() + "\n")

# Save source sentences (English)
with open("source.txt", "w", encoding="utf-8") as f:
    for s in val_sources:
        f.write(s.strip() + "\n")


In [19]:
from comet import download_model, load_from_checkpoint

# Download and load the recommended COMET model
model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(model_path) # Renamed the variable from model to comet_model

# Read data
with open("source.txt", "r", encoding="utf-8") as f:
    src = [line.strip() for line in f]
with open("system_output.txt", "r", encoding="utf-8") as f:
    mt = [line.strip() for line in f]
with open("reference.txt", "r", encoding="utf-8") as f:
    ref = [line.strip() for line in f]

# Prepare data for COMET
data = [{"src": s, "mt": m, "ref": r} for s, m, r in zip(src, mt, ref)]

# Run COMET
model_output = comet_model.predict(data, batch_size=8, gpus=1 if torch.cuda.is_available() else 0)
print("COMET Score:", model_output["system_score"])

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

LICENSE: 0.00B [00:00, ?B/s]

checkpoints/model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
INFO:pytorch_lightning.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA L4') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generate

COMET Score: 0.8248204116546919


**Entity-Centric Evaluation Metrics**

Beyond M-ETA, we can use standard NER evaluation metrics to assess how well our system translates and preserves named entities. These include:
- **Mention Coverage**: The percent of gold entity mentions that appear in the system outputs.
- **Precision, Recall, F1 (Entity-Level)**: How accurately the system's outputs contain the same named entities as the reference.
- **Entity-Level Exact Match**: The proportion of sentences with perfect entity match between system output and gold.
These metrics offer insight into the explainability and real-world utility of the translation system from an NER perspective.


In [21]:
tp = fp = fn = 0
correct = 0
total = 0

for pred, mention in zip(val_preds, val_mentions):
    if mention:
        total += 1
        if mention in pred:
            correct += 1
            tp += 1
        else:
            fn += 1

eta = correct / total if total else 0
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall    = tp / (tp + fn) if (tp + fn) > 0 else 0
f1        = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"Entity Translation Accuracy: {eta*100:.2f}% ({correct}/{total})")
print(f"Entity-level Precision: {precision:.2f}")
print(f"Entity-level Recall:    {recall:.2f}")
print(f"Entity-level F1:        {f1:.2f}")


Entity Translation Accuracy: 13.97% (176/1260)
Entity-level Precision: 1.00
Entity-level Recall:    0.14
Entity-level F1:        0.25


**STEP: 12 Model Inference on Test Data**

We use our fine-tuned MarianMT model to translate a list of English test sentences to German, processing inputs in batches to avoid memory issues. The outputs are printed and saved for further evaluation or submission.


In [23]:
# Extract English sentences from your test data (adjust key if needed)
test_sources = [ex["source"] for ex in test_data if ex.get("source")]


def batched_inference(sources, model, tokenizer, batch_size=2, max_length=64):
    preds = []
    model.eval()
    for i in range(0, len(sources), batch_size):
        batch = sources[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = model.generate(**inputs)
        batch_preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        preds.extend(batch_preds)
    return preds

# Run inference
test_preds = batched_inference(test_sources, model, tokenizer, batch_size=8, max_length=64)


In [24]:
# Print a few predictions
for src, pred in zip(test_sources[:5], test_preds[:5]):
    print("EN:", src)
    print("DE:", pred)
    print()

# Save to file for submission or analysis
with open("test_predictions.txt", "w", encoding="utf-8") as f:
    for p in test_preds:
        f.write(p.strip() + "\n")


EN: Who directed American Murder: The Family Next Door?
DE: Wer führte Regie bei American Murder: The Family Next Door?

EN: When was the movie American Murder: The Family Next Door released?
DE: Wann wurde der Film American Murder: The Family Next Door veröffentlicht?

EN: Is American Murder: The Family Next Door based on a true story?
DE: Basiert American Murder: The Family Next Door auf einer wahren Geschichte?

EN: Where is the Seal of the Confederate States currently displayed?
DE: Wo wird derzeit das Siegel der Konföderierten Staaten dargestellt?

EN: Who created the Seal of the Confederate States?
DE: Wer hat das Siegel der Konföderierten Staaten geschaffen?

