## Inference Code using t_5_small_model English to German

In [9]:
import pandas as pd
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Function to translate a single English sentence to German
def translate_sentence(english_sentence):
    try:
        print("Input English sentence:", english_sentence)
        # Tokenize input text
        inputs = tokenizer.encode("translate English to German: " + english_sentence, return_tensors="pt", max_length=512, truncation=True)
        
        # Generate translation
        with torch.no_grad():
            outputs = model.generate(inputs, max_length=100, num_beams=4, early_stopping=True)
        
        # Decode the generated translation
        translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print("Translated German sentence:", translated_sentence)
        return translated_sentence
    except Exception as e:
        print("Error during translation:", e)
        return ""

# Function to translate English sentences in a CSV file and output translated CSV
def translate_csv(input_csv_path, output_csv_path):
    try:
        # Read input CSV
        df = pd.read_csv(input_csv_path)
        
        # Translate English sentences
        df['de'] = df['en'].apply(translate_sentence)  # Update column names
        
        # Save translated CSV
        df.to_csv(output_csv_path, index=False)
        
        print(f"Translation completed. Translated CSV saved to {output_csv_path}.")
    except Exception as e:
        print("Error during translation:", e)

# Run translation pipeline for CSV input and output
input_csv_path = "/test_data.csv"
output_csv_path = "output.csv"  # Update output file name
translate_csv(input_csv_path, output_csv_path)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Input English sentence: Obama receives Netanyahu
Translated German sentence: Obama erhält Netanjahu
Input English sentence: The relationship between Obama and Netanyahu is not exactly friendly.
Translated German sentence: Die Beziehung zwischen Obama und Netanjahu ist nicht gerade freundlich.
Input English sentence: The two wanted to talk about the implementation of the international agreement and about Teheran's destabilising activities in the Middle East.
Translated German sentence: Die beiden wollten über die Umsetzung des internationalen Abkommens und über Teherans destabilisierende Aktivitäten im Nahen Osten sprechen.
Input English sentence: The meeting was also planned to cover the conflict with the Palestinians and the disputed two state solution.
Translated German sentence: Das Treffen sollte auch den Konflikt mit den Palästinensern und die umstrittene zweistaatliche Lösung abdecken.
Input English sentence: Relations between Obama and Netanyahu have been strained for years.
Tra

## From English to German

In [7]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset
from bert_score import score
from sacrebleu import corpus_bleu
import os
import nltk
from nltk.translate import meteor_score

# Download necessary NLTK data
nltk.download("wordnet")

# Set CUDA_VISIBLE_DEVICES environment variable
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Load the training dataset (50,000 examples)
train_data = load_dataset("wmt16", "de-en", split="train[:50000]")

# Load the validation dataset
val_data = load_dataset("wmt16", "de-en", split="validation")

# Load the test dataset
test_data = load_dataset("wmt16", "de-en", split="test")


[nltk_data] Downloading package wordnet to /home/seraj/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
import pandas as pd
# Extract German and English translations
translations_val_df = [{"de": item["translation"]["de"], "en": item["translation"]["en"]} for item in val_data]

# Convert translations to Pandas DataFrame
val_df = pd.DataFrame(translations_val_df)

# Display the DataFrame
val_df

Unnamed: 0,de,en
0,Die Premierminister Indiens und Japans trafen ...,India and Japan prime ministers meet in Tokyo
1,Indiens neuer Premierminister Narendra Modi tr...,"India's new prime minister, Narendra Modi, is ..."
2,Herr Modi befindet sich auf einer fünftägigen ...,Mr Modi is on a five-day trip to Japan to stre...
3,Pläne für eine stärkere kerntechnische Zusamme...,High on the agenda are plans for greater nucle...
4,Berichten zufolge hofft Indien darüber hinaus ...,India is also reportedly hoping for a deal on ...
...,...,...
2164,Die Wanderer zuerst um 9.30 Uhr.,The walkers started at 9.30 am.
2165,Es folgten die ersten Radfahrer und Läufer um ...,Then it was the turn of the cyclists and runne...
2166,Fünf Minuten später legten die ersten Mountain...,Five minutes later the first Mountain-bikers s...
2167,"Bent Hansen, Vorsitzender des Vereins ""Radeln ...","Bent Hansen, Chairman of the Association 'Cycl..."


In [10]:
# Extract German and English translations
translations_test_data = [{"de": item["translation"]["de"], "en": item["translation"]["en"]} for item in test_data]

# Convert translations to Pandas DataFrame
test_df = pd.DataFrame(translations_test_data)

# Display the DataFrame
test_df

Unnamed: 0,de,en
0,Obama empfängt Netanyahu,Obama receives Netanyahu
1,Das Verhältnis zwischen Obama und Netanyahu is...,The relationship between Obama and Netanyahu i...
2,Die beiden wollten über die Umsetzung der inte...,The two wanted to talk about the implementatio...
3,Bei der Begegnung soll es aber auch um den Kon...,The meeting was also planned to cover the conf...
4,Das Verhältnis zwischen Obama und Netanyahu is...,Relations between Obama and Netanyahu have bee...
...,...,...
2994,Quecksilber gelangt vor allem durch die Kohlev...,Mercury is released into the environment prima...
2995,Die deutschen Kohlekraftwerke stoßen laut eine...,"German coal plants, according to written infor..."
2996,Die Konzentration von Quecksilber in Fischen e...,"The concentration of mercury in fish, for exam..."
2997,Im vergangenen Jahr zählten europaweite Warnun...,"In the past year, Europe-wide alerts on mercur..."


In [15]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_metric
from nltk.translate.bleu_score import corpus_bleu
import pandas as pd
from bert_score import BERTScorer

# Load evaluation metrics
meteor_score = load_metric("meteor")
scorer = BERTScorer(lang="de", rescale_with_baseline=True)

# Function to translate sentences from English to German
def translate_sentences(sentences):
    translations = []
    for sentence in sentences:
        # Prepend the prefix to the input sentence for English to German translation
        input_text = "translate English to German: " + sentence["en"]
        # Tokenize input text
        input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True).to(device)
        # Generate translation
        with torch.no_grad():
            outputs = model.generate(input_ids=input_ids, max_length=100, num_beams=4, early_stopping=True)
        # Decode the generated translation
        translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
        translations.append(translation)
    return translations

# Function to compute METEOR score for German to English translation
def compute_meteor(references, predictions):
    return meteor_score.compute(predictions=predictions, references=references)["meteor"]

# Function to compute BERTScore for German to English translation
def compute_bertscore(references, predictions):
    P, R, F1 = scorer.score(references, predictions)
    return F1.mean().item()

# Function to compute BLEU score for specified n-gram order
def compute_bleu_ngram(references, predictions, ngram_order):
    return corpus_bleu([[ref] for ref in references], predictions, weights=(1/ngram_order,)*ngram_order)

# Translate validation and test sets
val_translations = translate_sentences(val_df.to_dict("records"))
test_translations = translate_sentences(test_df.to_dict("records"))

# Extract reference translations from validation and test sets
val_references = val_df["de"].tolist()
test_references = test_df["de"].tolist()

# Compute evaluation metrics
val_meteor = compute_meteor(val_references, val_translations)
val_bertscore = compute_bertscore(val_references, val_translations)
test_meteor = compute_meteor(test_references, test_translations)
test_bertscore = compute_bertscore(test_references, test_translations)


# Compute BLEU scores for validation and test data
val_bleu1 = compute_bleu_ngram(val_translations, val_references, 1)
val_bleu2 = compute_bleu_ngram(val_translations, val_references, 2)
val_bleu3 = compute_bleu_ngram(val_translations, val_references, 3)
val_bleu4 = compute_bleu_ngram(val_translations, val_references, 4)

test_bleu1 = compute_bleu_ngram(test_translations, test_references, 1)
test_bleu2 = compute_bleu_ngram(test_translations, test_references, 2)
test_bleu3 = compute_bleu_ngram(test_translations, test_references, 3)
test_bleu4 = compute_bleu_ngram(test_translations, test_references, 4)


# Print evaluation metrics
print("Validation METEOR:", val_meteor)
print("Validation BERTScore:", val_bertscore)
print("Validation BLEU-1:", val_bleu1)
print("Validation BLEU-2:", val_bleu2)
print("Validation BLEU-3:", val_bleu3)
print("Validation BLEU-4:", val_bleu4)

print("Test METEOR:", test_meteor)
print("Test BERTScore:", test_bertscore)
print("Test BLEU-1:", test_bleu1)
print("Test BLEU-2:", test_bleu2)
print("Test BLEU-3:", test_bleu3)
print("Test BLEU-4:", test_bleu4)


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
[nltk_data] Downloading package wordnet to /home/seraj/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/seraj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/seraj/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Validation METEOR: 0.5515488896080301
Validation BERTScore: 0.6489973664283752
Validation BLEU-1: 0.8645239244176106
Validation BLEU-2: 0.785705432461213
Validation BLEU-3: 0.7205451711521605
Validation BLEU-4: 0.6694257941058802
Test METEOR: 0.5902469429945494
Test BERTScore: 0.6776204705238342
Test BLEU-1: 0.871053573206133
Test BLEU-2: 0.8009281409830009
Test BLEU-3: 0.742822494487264
Test BLEU-4: 0.6966520372661966
