<a href="https://colab.research.google.com/github/r-m-steffi/EAMT_BART_MARIAN/blob/main/Bart_semeval_EAMT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem Statement : The task is to develop machine translation systems that can accurately translate s named entities in the input sentence to the target language.
Here, the source language is English and target language is Arabic.


*  Named entities are entities that are referred to by proper names, such as people, organizations, locations, dates, and more.
* Named entities are often challenging even for human translators, as sometimes there are cultural or domain-specific references that are not easily translatable.

#Dataset:
The dataset we are using is from the source, mintaka.
For our project we are using training and validation data

Install all important libraries that is need for the Semeval Task2


#Install required libraries

In [1]:
!pip install unbabel-comet

Collecting unbabel-comet
  Downloading unbabel_comet-2.2.6-py3-none-any.whl.metadata (19 kB)
Collecting entmax<2.0,>=1.1 (from unbabel-comet)
  Downloading entmax-1.3-py3-none-any.whl.metadata (348 bytes)
Collecting jsonargparse==3.13.1 (from unbabel-comet)
  Downloading jsonargparse-3.13.1-py3-none-any.whl.metadata (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy<2.0.0,>=1.20.0 (from unbabel-comet)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf<5.0.0,>=4.24.4 (from unbabel-comet)
  Downloading protobuf-4.25.8-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Collecting pytorch-lightning<3.0.0,>=2.0.0 (from unbabel-comet)
  Downloading pytorch_lightning-2.5.2-py3-none-any.whl.metadata (2

In [1]:

!pip install transformers datasets sentencepiece evaluate sacrebleu --quiet


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h

#Various imports
pandas: Used for reading and manipulating tabular data (train.csv, val.csv, etc.).

json: Used for reading and parsing JSON/JSONL structured files, such as the SemEval Task 2 datasets.

torch
PyTorch library is used for running deep learning models on CPU or GPU (e.g., torch.cuda.is_available()).

datasets.Dataset
Loads Hugging Face-compatible dataset from a Pandas DataFrame using Dataset.from_pandas().

transformers modules:
MBartForConditionalGeneration: Loads the pretrained mBART-50 multilingual model for conditional generation tasks like translation.

MBart50TokenizerFast: The fast tokenizer version for the mbart50 model, required for preparing inputs and outputs.

Seq2SeqTrainer: A high-level trainer specifically for sequence-to-sequence models (like translation or summarization).

Seq2SeqTrainingArguments: Holds hyperparameters and training configuration (batch size, learning rate, epochs, etc.).

DataCollatorForSeq2Seq: Automatically pads input and label sequences in a batch to the maximum length (useful for dynamic batching).




In [2]:
# Data handling
import pandas as pd
import json

# Model & tokenizer handling
import torch
from datasets import Dataset
from transformers import (
    MBartForConditionalGeneration,         # Pretrained multilingual translation model
    MBart50TokenizerFast,                  # Fast tokenizer for mbart50 model
    Seq2SeqTrainer,                        # Trainer class for sequence-to-sequence models
    Seq2SeqTrainingArguments,              # Training args specific for seq2seq
    DataCollatorForSeq2Seq                 # Dynamically pads sequences during batching
)

# Metric loading for BLEU
import evaluate


#Download Data then unzip data

This function downloads a .zip dataset from a given URL, extracts its contents to a folder, and then deletes the .zip archive.

In [3]:
'''Download data'''
def download_data_and_prep(url,filename):
  import requests

  response = requests.get(url)
  with open(filename, 'wb') as f:
    f.write(response.content)

  print(f"{filename} downloaded successfully.")
  '''Unzip the zip file then delete the zip file'''
  import zipfile
  import os
  extract_folder = filename[:-4]  # You can rename this
  os.makedirs(extract_folder, exist_ok=True)

  with zipfile.ZipFile(filename, 'r') as zip_ref:
    zip_ref.extractall(extract_folder)

  print(f"Extracted to: {extract_folder}")


  # Delete the zip file

  os.remove(filename)
  print(f"Deleted archive: {filename}")

#Give train and validation data url and file name

In [4]:
train_url = 'https://sapienzanlp.github.io/ea-mt/assets/files/semeval.train.v2-e0d1c28b78c8dd4969d25eea5d3bc9cc.zip'
train_filename = 'train_data.zip'
val_url = 'https://sapienzanlp.github.io/ea-mt/assets/files/semeval.validation.v2-889a1492ba6c3791baa8f4224bc8e685.zip'
val_filename = 'val_data.zip'

#Download train and validation data

In [5]:
download_data_and_prep(train_url,train_filename)
download_data_and_prep(val_url,val_filename)

train_data.zip downloaded successfully.
Extracted to: train_data
Deleted archive: train_data.zip
val_data.zip downloaded successfully.
Extracted to: val_data
Deleted archive: val_data.zip


#Convert json data to pandas dataframe

This function convertsthe .jsonl file to pandas dataframe

In [6]:
'''Convert json to dataframe'''
def json_to_df(path):
  import json
  import pandas as pd
  jsonl_path = path
  with open(jsonl_path, 'r', encoding='utf-8') as f:
    data = [json.loads(line) for line in f]

  # Convert to DataFrame for inspection
  df = pd.DataFrame(data)
  return df

#Store result in train_df and val_df

In [7]:
#Json to df for both train and test
train_df = json_to_df("train_data/semeval/train/de/train.jsonl")
val_df = json_to_df("val_data/validation/de_DE.jsonl")


#Explore Validation dataset

In [8]:
val_df['target'] = val_df['targets']
val_df.drop('targets',axis=1, inplace= True)
val_df['target']

Unnamed: 0,target
0,[{'translation': 'Wer spielte die Hauptrolle i...
1,[{'translation': 'Wann wurde Der Maulwurf: Und...
2,[{'translation': 'Was ist das Thema der TV-Ser...
3,[{'translation': 'Wie erreichen Besucher das B...
4,[{'translation': 'Wo befindet sich das Burg Li...
...,...
726,[{'translation': 'Wer spielte die Rolle von Ta...
727,[{'translation': 'Wie viele Staffeln gab es in...
728,[{'translation': 'Wo wurde Djatlow-Pass – Tod ...
729,[{'translation': 'Wer sind die Hauptfiguren in...


In [9]:
print(type(train_df["target"][0]))
print(train_df["target"][0])


<class 'str'>
Wie heißt der siebthöchste Berg Nordamerikas?


#Since validation set has multiple targets for multiple entities, flaten the target columnn

#Define the model to use (mBART supports 50+ languages)
Load tokenizer and model from Hugging Face
Set the tokenizer's source and target language for encoding input text.

We use spaCy to automatically detect and tag named entities in the English source sentences.
We manually insert <entity> tags around the known entity mention in the source, and align it with the corresponding translation.

In [10]:
# For train_df: Assume target is a single string already
#flat_train_df = train_df.rename(columns={"source": "input", "target": "target"})
# Wrap Train in entity tags using spacy
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Function to wrap detected entities with <entity> tags
def tag_entities_spacy(text):
    doc = nlp(text)
    for ent in reversed(doc.ents):  # Reverse to avoid offset issues
        text = text[:ent.start_char] + f"<entity>{ent.text}</entity>" + text[ent.end_char:]
    return text

# Apply to training data
train_df["input"] = train_df["source"].apply(tag_entities_spacy)

# If train_df["target"] is already clean, keep it
train_df["target"] = train_df["target"]  # Or rename as needed


# For val_df: Flatten list of translations (target) per source
def flatten_val_df(df):
    flat_data = []
    for _, row in df.iterrows():
        #for tgt in row["target"]:  # Each entry is a dict with 'mention' and 'translation'
        tgt = row["target"][0] # Take only the first translation
        # Use XML-style tags for the entity
        tagged_input = row["source"].replace(tgt["mention"], f"<entity>{tgt['mention']}</entity>")
        flat_data.append({
            "input": tagged_input,
            "target": tgt["translation"]
        })
    return pd.DataFrame(flat_data)

flat_val_df = flatten_val_df(val_df)


Define the model to use (mBART supports 50+ languages)
Load tokenizer and model from Hugging Face

Define source and target languages.
These should be ISO language codes supported by mBART50.
Set the tokenizer's source language for encoding input text.

In [11]:
# Define the model to use (mBART supports 50+ languages)
model_name = "facebook/mbart-large-50-many-to-many-mmt"

# Load tokenizer and model from Hugging Face
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

# Define source and target languages.
# These should be ISO language codes supported by mBART50.
SRC_LANG = "en_XX"
TGT_LANG = "de_DE"  # Change this to the appropriate target language

# Set the tokenizer's source language for encoding input text.
tokenizer.src_lang = SRC_LANG
tokenizer.tgt_lang = TGT_LANG


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

# Function to tokenize each example

After wrapping entities in <entity> tags and organizing the input/output columns, we tokenize the data for fine-tuning a sequence-to-sequence model (like facebook/mbart-large-50-many-to-many-mmt).

tokenizer.src_lang = "en_XX" sets the source language for the mBART tokenizer.

with tokenizer.as_target_tokenizer() ensures that the target is tokenized using the decoder's vocabulary.

padding="max_length" ensures consistent input sizes for batching.

max_length=128 is a configurable limit on sequence length.

labels are required during training for supervised learning.

In [12]:
# Function to tokenize each example
def tokenize_fn(example):
    tokenizer.src_lang = "en_XX"
    model_inputs = tokenizer(example["input"], padding="max_length", truncation=True, max_length=128)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(example["target"], padding="max_length", truncation=True, max_length=128)

    model_inputs["labels"] = labels["input_ids"]  # Add labels for training
    return model_inputs

# Convert pandas to Hugging Face dataset and tokenize
train_dataset = Dataset.from_pandas(train_df).map(tokenize_fn)
val_dataset = Dataset.from_pandas(flat_val_df).map(tokenize_fn)


Map:   0%|          | 0/4087 [00:00<?, ? examples/s]



Map:   0%|          | 0/731 [00:00<?, ? examples/s]

# Evaluate using COMET

To use COMET via Hugging Face's evaluate library, you need to install the required dependency first.

This will load the COMET metric, which you can use to evaluate machine translation quality based on context-aware neural evaluation.

In [13]:
from evaluate import load
comet = load("comet")


Downloading builder script: 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

LICENSE: 0.00B [00:00, ?B/s]

checkpoints/model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


This function is passed to the Seq2SeqTrainer as compute_metrics. It decodes predictions and labels, aligns them with their respective sources, and computes the COMET score:

In [14]:
from evaluate import load

# Load COMET metric once globally
comet_metric = load("comet")

# Define compute_metrics to use inside Seq2SeqTrainer
def compute_metrics_comet(eval_pred):
    predictions, labels = eval_pred

    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Wrap reference labels to match expected format
    references = decoded_labels

    # Get sources from val_dataset (must be defined in global scope)
    sources = [example["input"] for example in val_dataset]

    # COMET expects: source (en), prediction (ar), reference (ar)
    result = comet_metric.compute(predictions=decoded_preds, references=references, sources=sources)

    # Return in format expected by `metric_for_best_model`
    return {"comet": result["mean_score"]}


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`


# Training Argument for COMET

We fine-tune the facebook/mbart-large-50-many-to-many-mmt model using Hugging Face's Seq2SeqTrainer and monitor performance using the COMET metric.
We ensure that the decoder knows the target language (important for multilingual models like mBART) by explicitly setting forced_bos_token_id.
This is critical for language-specific decoding, ensuring the model generates output in the intended language

In [15]:
from transformers import EarlyStoppingCallback
from transformers import Seq2SeqTrainingArguments

training_args_comet = Seq2SeqTrainingArguments(
    output_dir="./ea_mt_model",
    per_device_train_batch_size=4,
    num_train_epochs=5,                          # Use more epochs; early stopping will prevent overfitting
    learning_rate=1e-5,
    logging_dir="./logs",
    eval_strategy="epoch",                 # COMET evaluated each epoch
    save_strategy="epoch",
    save_total_limit=2,
    predict_with_generate=True,
    report_to="none",                            # Disable wandb
    fp16=torch.cuda.is_available(),
    generation_max_length=128,
    generation_num_beams=4,

)

model.config.forced_bos_token_id = tokenizer.lang_code_to_id["de_DE"]


We use Hugging Face's Seq2SeqTrainer to manage the entire training and evaluation loop for our entity-aware machine translation task.

This single line initiates: trainer.train()

The training loop across the specified number of epochs



In [16]:
# Trainer will handle training loop, eval, saving, etc.

from transformers import Seq2SeqTrainer, DataCollatorForSeq2Seq

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args_comet,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model),
)
trainer.train()


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,0.0954,0.179101
2,0.0577,0.186777
3,0.0385,0.197286
4,0.0273,0.209128
5,0.0201,0.213881




TrainOutput(global_step=5110, training_loss=0.21489716290027894, metrics={'train_runtime': 915.3466, 'train_samples_per_second': 22.325, 'train_steps_per_second': 5.583, 'total_flos': 5535662100971520.0, 'train_loss': 0.21489716290027894, 'epoch': 5.0})

In [17]:
print("Train size:", len(train_df))
print("Val size:", len(flat_val_df))
print("\nSample train:")
print(train_df.sample(3))

print("\nSample val:")
print(flat_val_df.sample(3))

Train size: 4087
Val size: 731

Sample train:
            id source_locale target_locale  \
599   be29d8d9            en            de   
752   01cf3fca            en            de   
3095  9dc75da3            en            de   

                                                 source  \
599   Which book in the Hitchhiker's Guide to the Ga...   
752   Is North America or South America larger by area?   
3095      Which author has sold the most mystery books?   

                                                 target    entities     from  \
599   Welches Buch in der Reihe Per Anhalter durch d...      [Q571]  mintaka   
752   Ist Nordamerika oder Südamerika flächenmäßig g...  [Q49, Q18]  mintaka   
3095  Welcher Schriftsteller hat die meisten Krimis ...    [Q36180]  mintaka   

                                                  input  
599   Which book in the Hitchhiker's Guide to the Ga...  
752   Is <entity>North America</entity> or <entity>S...  
3095      Which author has sold the m

# TRanslate and Evaluate Validation Dataset COMET

We evaluate the translation quality of our fine-tuned entity-aware mBART model using the COMET metric, which considers source, reference, and generated translations to produce a more human-aligned score

In [18]:
from evaluate import load

# Step 1: Generate predictions on the validation set
results = trainer.predict(val_dataset)

# Step 2: Decode the predicted tokens and label tokens
decoded_preds = tokenizer.batch_decode(results.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(results.label_ids, skip_special_tokens=True)

# Step 3: Get the source sentences from the validation dataset
# val_dataset is tokenized, but still contains the original "input"
sources = [example["input"] for example in val_dataset]

# Step 4: Load the COMET metric
comet = load("comet")

# Step 5: Compute COMET score
comet_result = comet.compute(predictions=decoded_preds, references=decoded_labels, sources=sources)

# Step 6: Print result
print("Fine Tuned COMET Score:", comet_result["mean_score"])


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
INFO:pytorch_lightning.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_

Fine Tuned COMET Score: 0.8493510397041545


# Work with Base Model to compare with finetuned model

In [19]:
# Define the model to use (mBART supports 50+ languages)
base_model_name = "facebook/mbart-large-50-many-to-many-mmt"

# Load tokenizer and model from Hugging Face
base_tokenizer = MBart50TokenizerFast.from_pretrained(base_model_name)
base_model = MBartForConditionalGeneration.from_pretrained(base_model_name)

# Define source and target languages.
# These should be ISO language codes supported by mBART50.
SRC_LANG = "en_XX"
TGT_LANG = "de_DE"  # Change this to the appropriate target language

# Set the tokenizer's source language for encoding input text.
base_tokenizer.src_lang = SRC_LANG
base_tokenizer.tgt_lang = TGT_LANG


In [20]:
from transformers import AutoModelForSeq2SeqLM
import torch
from evaluate import load

# Step 1: Load base (untrained) mBART model
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")
base_model.eval()

# Step 2: Generate translations using base model
base_predictions = []

for example in val_dataset:
    # Tokenize input and move to device
    inputs = tokenizer(
        example["input"],
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=128
    ).to(base_model.device)

    # Generate translation (French target)
    with torch.no_grad():
        output = base_model.generate(
            **inputs,
            forced_bos_token_id=tokenizer.lang_code_to_id["de_DE"]
        )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    base_predictions.append(decoded)


In [21]:
sources = [example["input"] for example in val_dataset]

# Step 4: Load the COMET metric
comet = load("comet")

# Step 5: Compute COMET score
comet_result = comet.compute(predictions=base_predictions, references=decoded_labels, sources=sources)

# Step 6: Print result
print("Base COMET Score:", comet_result["mean_score"])

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`
/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
INFO:pytorch_lightning.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_

Base COMET Score: 0.8466410838301954


In [22]:
! pip install seqeval


Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=c528429320a88dcba3b1b6bc4ca49ce3d88b45b37d57ea736627497c80e46b9f
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [36]:
train_df_acc = json_to_df("train_data/semeval/train/de/train.jsonl")

In [40]:
import spacy
from tqdm import tqdm

# Load spaCy English NER model
nlp = spacy.load("en_core_web_sm")

def compute_entity_preservation(sources, translations):
    total_entities = 0
    preserved_entities = 0
    missed_entities = []

    for src, pred in tqdm(zip(sources, translations), total=len(sources)):
        doc = nlp(src)
        ents = [ent.text for ent in doc.ents]
        total_entities += len(ents)

        for ent in ents:
            if ent in pred:
                preserved_entities += 1
            else:
                missed_entities.append((ent, pred))

    # Avoid division by zero
    preservation_rate = preserved_entities / total_entities if total_entities > 0 else 0.0

    return {
        "total_entities": total_entities,
        "preserved_entities": preserved_entities,
        "missed": missed_entities,
        "entity_preservation_rate": round(preservation_rate * 100, 2)
    }



# Accuracy on Base Model

In [42]:
# Assume you're using decoded source and predicted translations
# sources = [ex["input"] for ex in val_dataset]  # tokenized dataset

metrics = compute_entity_preservation(sources, base_predictions)

print(f"\nEntity Preservation Rate: Accuracy : {metrics['entity_preservation_rate']}%")
print(f"Total Entities: {metrics['total_entities']}")
print(f"Preserved: {metrics['preserved_entities']}")
print(f"Examples of missed entity translations: {metrics['missed'][:5]}")


100%|██████████| 731/731 [00:04<00:00, 172.92it/s]


Entity Preservation Rate: Accuracy : 57.2%
Total Entities: 757
Preserved: 433
Examples of missed entity translations: [('Liebenzell Castle', 'Wie erreichen Sie das Schloss Liebenzell?'), ('Liebenzell Castle', 'Wo befindet sich das Schloss Liebenzell?'), ('United States', 'Was sind einige der wichtigsten Zutaten in den Vereinigten Staaten militärische Schokolade?'), ('United States', 'Wie unterscheidet sich militärische Schokolade in den Vereinigten Staaten von gewöhnlicher Schokolade?'), ('United States', 'Wie lange ist die militärische Schokolade in den Nahrungsmitteln der Vereinigten Staaten enthalten?')]





Accuracy on fine tuned Model

In [43]:
metrics = compute_entity_preservation(sources, decoded_preds)

print(f"\nEntity Preservation Rate: Accuracy : {metrics['entity_preservation_rate']}%")
print(f"Total Entities: {metrics['total_entities']}")
print(f"Preserved: {metrics['preserved_entities']}")
print(f"Examples of missed entity translations: {metrics['missed'][:5]}")

100%|██████████| 731/731 [00:04<00:00, 171.40it/s]


Entity Preservation Rate: Accuracy : 34.74%
Total Entities: 757
Preserved: 263
Examples of missed entity translations: [('North Korea', 'Wer spielte die Hauptrolle in „Die Kuschel – Untercover in Nordkorea“?'), ('North Korea', 'Wann wurde „Die Rabe – Untercover in Nordkorea“ veröffentlicht?'), ('The Mole – Undercover', 'Welches ist das Thema der Fernsehserie Der Mole – Untercover in Nordkorea?'), ('North Korea', 'Welches ist das Thema der Fernsehserie Der Mole – Untercover in Nordkorea?'), ('Liebenzell Castle', 'Wie kann der Besucher zum Schloss Liebenzell gelangen?')]



