# Translation Model with a focus on the languages english, german and spanish
    by Simon Scheer
## Project Description
The project aims to create a fine tuned version of the `facebook/mbart-large-50-many-to-many-mmt` model accessed through hugging face. The goal of this project was to create a fine tuning pipeling, by preparing a datset to be used for fine tuning the model, fine tuning the model, evaluating the model by creating a sample of 500 test translations and then afterwards using the fine tuned model to translate written as well as spoken language in spanish, english and german. The model chosen has 600 million parameters and for time reasons fine tuning all of the parameters would have not worked out therefore the chosen method was freezing most parameters and fine tuning only around 4 Million with the chosen trainings dataset this led still to a fine tuning duration of over 14 hours just for one epoch. After fine tuning the model functions were created to transcribe audio and detect the language spoken or written in the input source then the fine tuned model is used for translating the input and afterwards the output can be seen as written text but can be also listened to as an audio output. 

## Project Content
 1. **Import of python libraries used for the model**

 2. **Fine tuning pipeline**
   
       **2.1. Data preparation**
   
       **2.2. Model Setup**
   
       **2.3. Tokenization**

       **2.4. Training Setup**

       **2.5. Training**

       **2.6. Load Models**

       **2.7. Evaluation**

3. **Implementation and Use of Translation Functions**


## 1. Import of python libraries


In [67]:
import torch
import datasets
import sentencepiece
import transformers
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
from peft import LoraConfig, get_peft_model, PeftModel
from datasets import ClassLabel, load_dataset
from collections import Counter
import pandas as pd
from tqdm import tqdm
import sacrebleu
import copy
import whisper
import sounddevice
from scipy.io.wavfile import write
from langdetect import detect
from TTS.api import TTS
from datetime import datetime
import time
import warnings
import re
warnings.filterwarnings("ignore", category=UserWarning)

print(f"MPS available: {torch.backends.mps.is_available()}")
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

MPS available: True
Using device: mps


In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## 2. Fine tuning pipeline

### 2.1. Data Preparation

For the datasets I decided to use the `opus_books` dataset from [hugging face](https://huggingface.co/datasets/Helsinki-NLP/opus_books) since it provides human translated translations for all language pairs the model was fine tuned on from `german to english`, `germand to spanish` and `english to spanish`. The dataset provides tens of thousands translated chunks to work with. The only downside is some books are written in rather old forms of the language and do not use words that are used in every day life sometimes. Since the dataset provides translation always only in one direction I decided to use also the reversed datasets for `english to german`, `spanish to german` and `spanish to english` so the performance should be well in both directions but it has to be taken into account that the transaltions are not direct word for word transaltion but human translations which might lead to some unexpected results in some cases since the transaltion does not seem correct since the translator chose a suiting translation for that case but not in general.

#### 2.1.1. Loading Dataset from Hugging Face Library

In [9]:
de_en_data = datasets.load_dataset("opus_books", "de-en")
en_es_data = datasets.load_dataset("opus_books", "en-es")
de_es_data = datasets.load_dataset("opus_books", "de-es")

#### 2.1.2 Converting Structure
Since the structure of the model input and the `opus_books` dataset differ a bit the data needs to be convertedinto the correct format. The following function is also used to preprocess the data a little bit excluding links and too long chunks for example.

In [10]:
def convert_to_train(ds, src_key, tgt_key, src_lang, tgt_lang):
    """Convert dataset to training format with filtering"""
    converted_data = []
    
    for i, row in enumerate(ds["train"]):
        if i == 0:  
            continue
        
        source_text = row["translation"].get(src_key, "").strip()
        target_text = row["translation"].get(tgt_key, "").strip()

        if len(source_text) < 2 or len(target_text) < 2:
            continue
        if len(source_text.split()) > 80 or len(target_text.split()) > 80:
            continue
        if "http" in source_text or "http" in target_text:
            continue

        converted_data.append({
            "source": source_text,
            "target": target_text,
            "src_lang": src_lang,
            "tgt_lang": tgt_lang
        })
    
    return converted_data

The created function is also used for creating the reversed dataset mentioned before. And then the datasets are all added together to get a `full_data` dataset which can be further processed.

In [37]:
de_to_en = convert_to_train(de_en_data, "de", "en", "de_DE", "en_XX")
en_to_de = convert_to_train(de_en_data, "en", "de", "en_XX", "de_DE")
en_to_es = convert_to_train(en_es_data, "en", "es", "en_XX", "es_XX")
es_to_en = convert_to_train(en_es_data, "es", "en", "es_XX", "en_XX")
de_to_es = convert_to_train(de_es_data, "de", "es", "de_DE", "es_XX")
es_to_de = convert_to_train(de_es_data, "es", "de", "es_XX", "de_DE")
full_data = (de_to_en + en_to_de + en_to_es + es_to_en + de_to_es + es_to_de)

After the `full_data` dataset is created it has to be converted to a hugging face dataset to be used as input in the model later.

In [38]:
dataset = datasets.Dataset.from_list(full_data)

Since I tied first to use the whole dataset in no certain order and it produced not very good outputs I decided to group the data by task and with task is meant the direction of translation.

In [39]:
def add_task_to_data(data):
    data["task"] = data["src_lang"] + " -> " + data["tgt_lang"]
    return data

dataset_with_tasks = dataset.map(add_task_to_data)

task_names = sorted(set(dataset_with_tasks["task"]))

dataset_with_tasks = dataset_with_tasks.cast_column(
    "task",
    ClassLabel(names=task_names)
)


Map:   0%|          | 0/337018 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/337018 [00:00<?, ? examples/s]

#### 2.1.3 Creating Test and Training Dataset
After creating the dataset with the taks column the dataset will be split into a train and test datasplit where the `test_size` was chosen to be very small since as much as possible data shpuld have been used for fine tuning.

In [40]:
splits = dataset_with_tasks.train_test_split(
    test_size=0.05,
    stratify_by_column="task",
    seed=42
)
train_data = splits["train"]
test_data = splits["test"]

### 2.2. Model Setup

As mentioned in the project description because of missing grpahical processing unit in my own latop the performance of training showed an expected training time of many days since this seemed to be unattainable in the project scope so I decided to use `LoRA` for freezing many parameters so my model can concentrate on some parameters (still around 4 Million) to be fine tuned.

#### 2.2.1. Loading the Model from Hugging Face 
More information on the model can be found on [Hugging Face](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt).

In [91]:
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

#### 2.2.2. Using LoRA to optimize Fine tuning
Since the training time for the chosen model `facebook/mbart-large-50-many-to-many-mmt` was out of scope another model would have been needed but by then the preprcosessing steps were already done and the model seemed to be very suitable for the use case so I decided to do some research of how to optimize the fine tuning and I found the option of using `LoRA` which essentially allows to fine tune only a certain amount of the model which should still lead to improvements in model performance. `LoRA` works by adding lightweight pieces to the original model instead opf training the whole model again it fine tunes just a set of parameters. A really important parameter which needs to be set to false is the `inference_mode` parameter if it is set to true the model actually does not train and save the parameters.

In [92]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "out_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_2_SEQ_LM",
    inference_mode=False  
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 4,718,592 || all params: 615,598,080 || trainable%: 0.7665


### 2.3. Tokenization
This part tokenizes the data and sorts it by task again so we keep the correct order.

In [93]:
train_data = train_data.sort("task")
test_data = test_data.sort("task")

In [14]:
def tokenization(data):
    tokenizer.src_lang = data["src_lang"]
    
    model_inputs = tokenizer(
        data["source"], 
        max_length=128, 
        truncation=True, 
        padding="max_length"
    )
    
    tokenizer.tgt_lang = data["tgt_lang"]
    
    labels = tokenizer(
        text_target=data["target"], 
        max_length=128, 
        truncation=True, 
        padding="max_length"
    )
    
    labels_ids = labels["input_ids"]
    labels_ids = [
        (token if token != tokenizer.pad_token_id else -100) 
        for token in labels_ids
    ]
    
    model_inputs["labels"] = labels_ids
    return model_inputs

Implementation of the tokenization function created above:

In [94]:
train_data_tokenized = train_data.map(tokenization, batched=False)
test_data_tokenized = test_data.map(tokenization, batched=False)

Map:   0%|          | 0/320167 [00:00<?, ? examples/s]

Map:   0%|          | 0/16851 [00:00<?, ? examples/s]

Remove the unnecessary columns (the columns containing the not tokenized data):

In [95]:
train_data_tokenized = train_data_tokenized.remove_columns(train_data.column_names)
test_data_tokenized = test_data_tokenized.remove_columns(test_data.column_names)

### 2.4. Training Setup

In [96]:
data_collator = transformers.DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
    pad_to_multiple_of=8
)

training_args = transformers.TrainingArguments(
    output_dir="./mbart_lora_fixed",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_steps=200,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    metric_for_best_model="loss",
    optim="adamw_torch",
    fp16=False,
    bf16=False,
    report_to="none",
    remove_unused_columns=False,
    dataloader_pin_memory=False
)

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data_tokenized,
    eval_dataset=test_data_tokenized,
    data_collator=data_collator,
    processing_class=tokenizer
)

### 2.5. Training 

In [None]:
trainer.train()

Saving the models created by the trainer

In [None]:
trainer.save_model("mbart50_lora_final_fixed")
tokenizer.save_pretrained("mbart50_lora_final_fixed")
model.save_pretrained("mbart50_lora_final_fixed")

### 2.6. Load Models 
Now the fine tuned model as well as the base model are loaded again so they can be afterwards evaluated, since not the whole model was fine tuned but only some parameters the base model and the adapter have to be combined which leads to the inference model being created.

In [97]:
base_model_comparison = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
base_model_comparison = base_model_comparison.to(device)
base_model_comparison.eval()

base_model_finetuned = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer_inference = MBart50TokenizerFast.from_pretrained("mbart50_lora_final_fixed")
model_inference = PeftModel.from_pretrained(base_model_finetuned, "mbart50_lora_final_fixed")
model_inference = model_inference.to(device)
model_inference.eval()

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): MBartForConditionalGeneration(
      (model): MBartModel(
        (shared): MBartScaledWordEmbedding(250054, 1024, padding_idx=1)
        (encoder): MBartEncoder(
          (embed_tokens): MBartScaledWordEmbedding(250054, 1024, padding_idx=1)
          (embed_positions): MBartLearnedPositionalEmbedding(1026, 1024)
          (layers): ModuleList(
            (0-11): 12 x MBartEncoderLayer(
              (self_attn): MBartAttention(
                (k_proj): lora.Linear(
                  (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1024, out_features=16, bias=False)
                  )
                  (lora_B): ModuleDict(
                    (default): Linear(in_features=16, out_featur

Creation of translation function for comaprison of both models used for creation of dataset used to compare each output.

In [98]:
def translate(text, src="en_XX", tgt="es_XX"):
    tokenizer_inference.src_lang = src
    
    encoded = tokenizer_inference(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    )
    
    #model_inputs = {k: v.to(model_inference.device) for k, v in encoded.items()}
    model_inputs = encoded.to(model_inference.device)
    
    with torch.no_grad():
        generated_tokens = model_inference.generate(
            **model_inputs,
            forced_bos_token_id=tokenizer_inference.lang_code_to_id[tgt],
            max_length=128,
            num_beams=5
        )
    
    return tokenizer_inference.batch_decode(generated_tokens, skip_special_tokens=True)[0]

def translate_base(text, src="en_XX", tgt="es_XX"):
    tokenizer.src_lang = src
    
    encoded = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    )
    
    #model_inputs = {k: v.to(base_model_comparison.device) for k, v in encoded.items()}
    model_inputs = encoded.to(model_inference.device)
    
    with torch.no_grad():
        generated_tokens = base_model_comparison.generate(
            **model_inputs,
            forced_bos_token_id=tokenizer.lang_code_to_id[tgt],
            max_length=128,
            num_beams=5
        )
    
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

### 2.7. Evaluation of fine tuned model cpmpared to base level

In [99]:
def evaluate_translations(data, num_samples=500, save_path="evaluation_results.csv"):
    results = []
    
    sample_data = data.select(range(min(num_samples, len(data))))
    
    for example in tqdm(sample_data):
        base_translation = translate_base(
            example["source"], 
            src=example["src_lang"], 
            tgt=example["tgt_lang"]
        )
        
        fine_tuned_translation = translate(
            example["source"], 
            src=example["src_lang"], 
            tgt=example["tgt_lang"]
        )
        
        results.append({
            "input": example["source"],
            "reference": example["target"],
            "base_translation": base_translation,
            "fine_tuned_translation": fine_tuned_translation,
            "task": f"{example['src_lang']}->{example['tgt_lang']}"
        })
    
    results_df = pd.DataFrame(results)
    results_df.to_csv(save_path, index=False)
    
    return results_df

In [100]:
eval_df = evaluate_translations(test_data, num_samples=500)

100%|█████████████████████████████████████████| 500/500 [16:23<00:00,  1.97s/it]


In [101]:
bleu_base = sacrebleu.corpus_bleu(
    eval_df["base_translation"].tolist(), 
    [eval_df["reference"].tolist()]
)

bleu_fine_tuned = sacrebleu.corpus_bleu(
    eval_df["fine_tuned_translation"].tolist(), 
    [eval_df["reference"].tolist()]
)

print(bleu_base.score, bleu_fine_tuned.score, bleu_fine_tuned.score - bleu_base.score)


18.76701576329957 22.371763468091533 3.6047477047919614


## 3. Second Iteration of fine tuning
After the first iteration of the fine tuning was concluded and I still wanted better results I decided to fine tune the model on a second different dataset to try to improve the model even more. For this reason there will now be a second version of most steps described before already some may be even redundand because they were already used before some new functions. For this time we used the `Amani27/massive_transaltion_dataset` and created a trainingdataset of around 65.000 translation samples. This dataset contains more modern every day translations.

In [4]:
dataset_2 = datasets.load_dataset("Amani27/massive_translation_dataset")

### 3.1. Dataset Preparation
Since the first dataset we used and the second dataset have two different structures we needed a new function to convert the dataset so we can use it as input for the tokenizer later on.

In [5]:
def create_amani_datasets( src_lang, tgt_lang, dataset=dataset_2):

    output_data={}
    
    match src_lang:
        case "en_US":
            lang_in = "en_XX"
        case "de_DE":
            lang_in = src_lang
        case "es_ES":
            lang_in = "es_XX"
            
    match tgt_lang:
        case "en_US":
            lang_out = "en_XX"
        case "de_DE":
            lang_out = tgt_lang
        case "es_ES":
            lang_out = "es_XX"
        
    source_text = dataset["train"][src_lang]
    target_text = dataset["train"][tgt_lang]

    output_data = datasets.Dataset.from_dict({
        "source": list(dataset["train"][src_lang]),
        "target": list(dataset["train"][tgt_lang]),
        "src_lang": [lang_in] * len(dataset["train"]),
        "tgt_lang": [lang_out] * len(dataset["train"]),
    })
    return output_data

    

This function is a bit more efficient thancthe one created earlier since it already outputs a hugging face dataset structure for each function call and the datasets have to be just combined in the next step.

In [6]:
dataset2_en_to_de = create_amani_datasets( src_lang="en_US", tgt_lang="de_DE")
dataset2_de_to_en = create_amani_datasets( src_lang="de_DE", tgt_lang="en_US")
dataset2_en_to_es = create_amani_datasets( src_lang="en_US", tgt_lang="es_ES")
dataset2_de_to_es = create_amani_datasets( src_lang="de_DE", tgt_lang="es_ES")
dataset2_es_to_de = create_amani_datasets( src_lang="es_ES", tgt_lang="de_DE")
dataset2_es_to_en = create_amani_datasets( src_lang="es_ES", tgt_lang="en_US")
dataset2 = datasets.concatenate_datasets([dataset2_en_to_de,dataset2_de_to_en,dataset2_en_to_es,dataset2_de_to_es,dataset2_es_to_de,dataset2_es_to_en])

As mentioned above in the first iteration this code is used to create a column containing the translation direction to later group by this column to have the correct order even after splitting in training and test dataset.

In [15]:
dataset_with_tasks2 = dataset2.map(add_task_to_data)

dataset_with_tasks2 = dataset_with_tasks2.cast_column(
    "task",
    ClassLabel(names=task_names)
)

Map:   0%|          | 0/69084 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/69084 [00:00<?, ? examples/s]

In [16]:
splits2 = dataset_with_tasks2.train_test_split(
    test_size=0.05,
    stratify_by_column="task",
    seed=42
)
train_data2 = splits2["train"]
test_data2 = splits2["test"]

### 3.2. Load first iteration of fine tuned model
After creating the dataset we need to get the tokenizer once again and also load the models and merge the base_model with the fine tuned adapter to work on the merged model for the second iteration of fine tuning.

In [19]:
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
base_model_finetuned2 = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
model_inference2 = PeftModel.from_pretrained(base_model_finetuned2, "mbart50_lora_final_fixed")
merged_model2 = model_inference2.merge_and_unload()

In [20]:
lora_config2 = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "out_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_2_SEQ_LM",
    inference_mode=False  
)

model2 = get_peft_model(merged_model2, lora_config2)
model2.print_trainable_parameters()

trainable params: 4,718,592 || all params: 615,598,080 || trainable%: 0.7665


### 3.3. Tokenization Second Iteration
After the model and the tokenizer are loaded the steps of tokenozation which use the same function as a bove can be done. First the dataset is sorted by task then the `tokenization`is used to tokenize all the data and afterwards the string input columns are removed so that only the numerical columns are kept in the end.

In [21]:
train_data2 = train_data2.sort("task")
test_data2 = test_data2.sort("task")

In [22]:
train_data_tokenized2 = train_data2.map(tokenization, batched=False)
test_data_tokenized2 = test_data2.map(tokenization, batched=False)

Map:   0%|          | 0/65629 [00:00<?, ? examples/s]

Map:   0%|          | 0/3455 [00:00<?, ? examples/s]

In [23]:
train_data_tokenized2 = train_data_tokenized2.remove_columns(train_data2.column_names)
test_data_tokenized2 = test_data_tokenized2.remove_columns(test_data2.column_names)

### 3.4. Training Setup for the second Iteration
we used more or less the same training setup but since we have a smaller dataset for the second iteration (1/5 of dataset one) it was decided that a higher number of the parameter `num_train_epochs`will be used (3 instead of 1 in the first iteration)

In [25]:
data_collator2 = transformers.DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model2,
    padding=True,
    pad_to_multiple_of=8
)

training_args2 = transformers.TrainingArguments(
    output_dir="./mbart_lora_fixed_round2",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_steps=200,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    metric_for_best_model="loss",
    optim="adamw_torch",
    fp16=False,
    bf16=False,
    report_to="none",
    remove_unused_columns=False,
    dataloader_pin_memory=False
)

trainer2 = transformers.Trainer(
    model=model2,
    args=training_args2,
    train_dataset=train_data_tokenized2,
    eval_dataset=test_data_tokenized2,
    data_collator=data_collator2,
    processing_class=tokenizer
)

### 3.4. Training

In [28]:
trainer2.train()

Step,Training Loss
200,1.731
400,1.5304
600,1.6477
800,1.6538
1000,1.5789
1200,1.5684
1400,1.5536
1600,1.5151
1800,1.5111
2000,1.4826


TrainOutput(global_step=49224, training_loss=1.0761945511138906, metrics={'train_runtime': 34484.5424, 'train_samples_per_second': 5.709, 'train_steps_per_second': 1.427, 'total_flos': 5.406437630017536e+16, 'train_loss': 1.0761945511138906, 'epoch': 3.0})

### 3.5. Load Models for Evaluation of Second Iteration

In [31]:
base_model_comparison = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
base_model_comparison = base_model_comparison.to(device)
#base_model_comparison.eval()

base_model_finetuned = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer_inference = MBart50TokenizerFast.from_pretrained("mbart50_lora_final_fixed")
model_inference = PeftModel.from_pretrained(base_model_finetuned, "mbart50_lora_final_fixed")
model_inference = model_inference.to(device)
#model_inference.eval()

base_model_fine_tuned2 = merged_model2
tokenizer_inference2 = MBart50TokenizerFast.from_pretrained("mbart_lora_fixed_round2/checkpoint-49224")
model_inference2 = PeftModel.from_pretrained(base_model_fine_tuned2, "mbart_lora_fixed_round2/checkpoint-49224")
model_inference2 = model_inference2.to(device)
#model_inference2.eval()

### 3.6. Evaluation of Second Iteration

In [30]:
def translate(text, src="en_XX", tgt="es_XX"):
    tokenizer_inference.src_lang = src
    
    encoded = tokenizer_inference(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    )
    
    #model_inputs = {k: v.to(model_inference.device) for k, v in encoded.items()}
    model_inputs = encoded.to(model_inference.device)
    
    with torch.no_grad():
        generated_tokens = model_inference.generate(
            **model_inputs,
            forced_bos_token_id=tokenizer_inference.lang_code_to_id[tgt],
            max_length=128,
            num_beams=5
        )
    output=tokenizer_inference.decode(generated_tokens[0], skip_special_tokens=True)
    return output

def translate_base(text, src="en_XX", tgt="es_XX"):
    tokenizer.src_lang = src
    
    encoded = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    )
    
    #model_inputs = {k: v.to(base_model_comparison.device) for k, v in encoded.items()}
    model_inputs = encoded.to(model_inference.device)
    
    with torch.no_grad():
        generated_tokens = base_model_comparison.generate(
            **model_inputs,
            forced_bos_token_id=tokenizer.lang_code_to_id[tgt],
            max_length=128,
            num_beams=5
        )
    output=tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
    return output

def translate2(text, src="en_XX", tgt="es_XX"):
    tokenizer_inference2.src_lang = src
    
    encoded = tokenizer_inference2(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    )
    
    #model_inputs = {k: v.to(model_inference.device) for k, v in encoded.items()}
    model_inputs = encoded.to(model_inference2.device)
    
    with torch.no_grad():
        generated_tokens = model_inference2.generate(
            **model_inputs,
            forced_bos_token_id=tokenizer_inference2.lang_code_to_id[tgt],
            max_length=128,
            num_beams=5
        )
    output=tokenizer_inference2.decode(generated_tokens[0], skip_special_tokens=True)
    return output

In [33]:
def evaluate_translations2(data, num_samples=500, save_path="evaluation_results.csv"):
    results = []
    
    sample_data = data.select(range(min(num_samples, len(data))))
    
    for example in tqdm(sample_data):
        base_translation = translate_base(
            example["source"], 
            src=example["src_lang"], 
            tgt=example["tgt_lang"]
        )
        
        fine_tuned_translation = translate(
            example["source"], 
            src=example["src_lang"], 
            tgt=example["tgt_lang"]
        )

        fine_tuned_translation2 = translate2(
            example["source"], 
            src=example["src_lang"], 
            tgt=example["tgt_lang"]
        )
        
        results.append({
            "input": example["source"],
            "reference": example["target"],
            "base_translation": base_translation,
            "fine_tuned_translation": fine_tuned_translation,
            "fine_tuned_translation2": fine_tuned_translation2,
            "task": f"{example['src_lang']}->{example['tgt_lang']}"
        })
    
    results_df = pd.DataFrame(results)
    results_df.to_csv(save_path, index=False)
    
    return results_df

In [34]:
eval_df2 = evaluate_translations2(test_data2, num_samples=500)

100%|█████████████████████████████████████████| 500/500 [08:29<00:00,  1.02s/it]


In [36]:
bleu_base = sacrebleu.corpus_bleu(
    eval_df2["base_translation"].tolist(), 
    [eval_df2["reference"].tolist()]
)

bleu_fine_tuned = sacrebleu.corpus_bleu(
    eval_df2["fine_tuned_translation"].tolist(), 
    [eval_df2["reference"].tolist()]
)

bleu_fine_tuned2 = sacrebleu.corpus_bleu(
    eval_df2["fine_tuned_translation2"].tolist(), 
    [eval_df2["reference"].tolist()]
)

print(bleu_base.score, bleu_fine_tuned.score, bleu_fine_tuned2.score, bleu_fine_tuned.score - bleu_base.score, bleu_fine_tuned2.score - bleu_base.score, bleu_fine_tuned2.score - bleu_fine_tuned.score)

34.1801531344744 23.420922631517467 51.92866366784773 -10.759230502956935 17.748510533373327 28.507741036330263


In [41]:
eval_df3 = evaluate_translations2(test_data, num_samples=500)

100%|█████████████████████████████████████████| 500/500 [24:38<00:00,  2.96s/it]


In [42]:
bleu_base = sacrebleu.corpus_bleu(
    eval_df3["base_translation"].tolist(), 
    [eval_df3["reference"].tolist()]
)

bleu_fine_tuned = sacrebleu.corpus_bleu(
    eval_df3["fine_tuned_translation"].tolist(), 
    [eval_df3["reference"].tolist()]
)

bleu_fine_tuned2 = sacrebleu.corpus_bleu(
    eval_df3["fine_tuned_translation2"].tolist(), 
    [eval_df3["reference"].tolist()]
)

print(bleu_base.score, bleu_fine_tuned.score, bleu_fine_tuned2.score, bleu_fine_tuned.score - bleu_base.score, bleu_fine_tuned2.score - bleu_base.score, bleu_fine_tuned2.score - bleu_fine_tuned.score)

13.798896112501707 19.8223741767145 11.604432934303517 6.023478064212794 -2.1944631781981894 -8.217941242410983


## 4. Implementation and Use Case of Translation Functions
As mentioned in the beginning of the Report the use case for this project would have been to create a fine tuned mersion of a model that can be used for transalting text as well as audio input and is trained with a focus on `English`, `German` and `Spanish`. To implement this Use Case another translation function is needed which can be used to translate also audios and can detect the language by itself.

In [78]:
#used for transcription
whisper_model = whisper.load_model("small")  

def record_audio(duration=5, filename="recording", samplerate=44100):
    print("Recording...")
    audio = sounddevice.rec(int(duration * samplerate), samplerate=samplerate, channels=1, device=1)
    print("Time left:")
    for remaining in range(duration, 0, -1):
        print(f"\r{remaining} ...", end="", flush=True)
        time.sleep(1)
    
    sounddevice.wait()
    now=datetime.now()
    filename = filename + now.strftime("%m_%d_%Y_%H_%M_%S")+".wav"
    write(filename, samplerate, audio)
    return filename

def transcribe_audio(path, model=whisper_model):
    result = model.transcribe(path)
    text = result["text"]
    return text

tts_en = TTS("tts_models/en/ljspeech/tacotron2-DDC")
tts_de = TTS("tts_models/de/thorsten/vits")
tts_es = TTS("tts_models/es/css10/vits")

def play_translation_audio(translation, tgt_in):
    match tgt_in:
        case "en":
            tts = tts_en
        case "de":
            tts = tts_de
        case "es":
            tts = tts_es
            
    audio = tts.tts(translation)

    sounddevice.play(audio, samplerate=tts.synthesizer.output_sample_rate, device=2)
    sounddevice.wait()

def translate_with_lang_detect2(tgt_in, text_in=None, audio_file=None, audio_input=0, audio_output=False):
    translations = []
    if text_in is not None or audio_file is not None or audio_input > 0:
        if audio_input > 0:

            audio_file = record_audio(audio_input)
        
        if audio_file is not None or audio_input > 0:

            text_in=transcribe_audio(audio_file)
            print(text_in)
        
        #detect language from written text with detect function from library
        src = detect(text_in)
    
        match src:
            case "en":
                tokenizer_inference.src_lang = "en_XX"
            case "de":
                tokenizer_inference.src_lang = "de_DE"
            case "es": 
                tokenizer_inference.src_lang = "es_XX"
    
        match tgt_in:
            case "en":
                tgt = "en_XX"
            case "de":
                tgt = "de_DE"
            case "es": 
                tgt = "es_XX"

        text_parts= re.split(r"(?<=[.!?])\s+", text_in)

        for text in text_parts:
        
            encoded = tokenizer_inference2(
                text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            )
            
            model_inputs = encoded.to(model_inference2.device)
            
            with torch.no_grad():
                generated_tokens = model_inference2.generate(
                    **model_inputs,
                    forced_bos_token_id=tokenizer_inference2.lang_code_to_id[tgt],
                    max_length=128,
                    num_beams=5, 
                    #early_stopping=True,
                    #no_repeat_ngram_size=3,  
                    #repetition_penalty=1.2
                )
                
            translation=tokenizer_inference2.decode(generated_tokens[0], skip_special_tokens=True)
            translations.append(translation)

        output = " ".join(translations)
        
        
        if audio_output is True:
            print("yes")
            play_translation_audio(output, tgt_in)
            
        return output
    else:
        return("Invalid input at least one of the three has to be set: text, audio_file or audio_input")


def translate_with_lang_detect(tgt_in, text_in=None, audio_file=None, audio_input=0, audio_output=False):
    translations = []
    if text_in is not None or audio_file is not None or audio_input > 0:
        if audio_input > 0:

            audio_file = record_audio(audio_input)
        
        if audio_file is not None or audio_input > 0:

            text_in=transcribe_audio(audio_file)
            print(text_in)
        
        #detect language from written text with detect function from library
        src = detect(text_in)
    
        match src:
            case "en":
                tokenizer_inference.src_lang = "en_XX"
            case "de":
                tokenizer_inference.src_lang = "de_DE"
            case "es": 
                tokenizer_inference.src_lang = "es_XX"
    
        match tgt_in:
            case "en":
                tgt = "en_XX"
            case "de":
                tgt = "de_DE"
            case "es": 
                tgt = "es_XX"

        text_parts= re.split(r"(?<=[.!?])\s+", text_in)

        for text in text_parts:
        
            encoded = tokenizer_inference(
                text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            )
            
            model_inputs = encoded.to(model_inference.device)
            
            with torch.no_grad():
                generated_tokens = model_inference.generate(
                    **model_inputs,
                    forced_bos_token_id=tokenizer_inference.lang_code_to_id[tgt],
                    max_length=128,
                    num_beams=5, 
                    #early_stopping=True,
                    #no_repeat_ngram_size=3,  
                    #repetition_penalty=1.2
                )
                
            translation=tokenizer_inference.decode(generated_tokens[0], skip_special_tokens=True)
            translations.append(translation)

        output = " ".join(translations)
        
        
        if audio_output is True:
            print("yes")
            play_translation_audio(output, tgt_in)
            
        return output
    else:
        return("Invalid input at least one of the three has to be set: text, audio_file or audio_input")

 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Setting up Audio P

In [109]:
def translate_with_lang_detect3(tgt_in, text_in=None, audio_file=None, audio_input=0, audio_output=False):
    translations = []
    if text_in is not None or audio_file is not None or audio_input > 0:
        if audio_input > 0:

            audio_file = record_audio(audio_input)
        
        if audio_file is not None or audio_input > 0:

            text_in=transcribe_audio(audio_file)
            print(text_in)
        
        #detect language from written text with detect function from library
        src = detect(text_in)
    
        match src:
            case "en":
                tokenizer.src_lang = "en_XX"
            case "de":
                tokenizer.src_lang = "de_DE"
            case "es": 
                tokenizer.src_lang = "es_XX"
    
        match tgt_in:
            case "en":
                tgt = "en_XX"
            case "de":
                tgt = "de_DE"
            case "es": 
                tgt = "es_XX"

        text_parts= re.split(r"(?<=[.!?])\s+", text_in)

        for text in text_parts:
        
            encoded = tokenizer(
                text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            )
            
            model_inputs = encoded.to(model_inference2.device)
            
            with torch.no_grad():
                generated_tokens = model_inference2.generate(
                    **model_inputs,
                    forced_bos_token_id=tokenizer.lang_code_to_id[tgt],
                    max_length=128,
                    num_beams=5, 
                    #early_stopping=True,
                    #no_repeat_ngram_size=3,  
                    #repetition_penalty=1.2
                )
                
            translation=tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
            translations.append(translation)

        output = " ".join(translations)
        
        
        if audio_output is True:
            print("yes")
            play_translation_audio(output, tgt_in)
            
        return output
    else:
        return("Invalid input at least one of the three has to be set: text, audio_file or audio_input")


def translate_with_lang_detect4(tgt_in, text_in=None, audio_file=None, audio_input=0, audio_output=False):
    translations = []
    if text_in is not None or audio_file is not None or audio_input > 0:
        if audio_input > 0:

            audio_file = record_audio(audio_input)
        
        if audio_file is not None or audio_input > 0:

            text_in=transcribe_audio(audio_file)
            print(text_in)
        
        #detect language from written text with detect function from library
        src = detect(text_in)
    
        match src:
            case "en":
                tokenizer.src_lang = "en_XX"
            case "de":
                tokenizer.src_lang = "de_DE"
            case "es": 
                tokenizer.src_lang = "es_XX"
    
        match tgt_in:
            case "en":
                tgt = "en_XX"
            case "de":
                tgt = "de_DE"
            case "es": 
                tgt = "es_XX"

        text_parts= re.split(r"(?<=[.!?])\s+", text_in)

        for text in text_parts:
        
            encoded = tokenizer(
                text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            )
            
            model_inputs = encoded.to(model_inference.device)
            
            with torch.no_grad():
                generated_tokens = model_inference.generate(
                    **model_inputs,
                    forced_bos_token_id=tokenizer.lang_code_to_id[tgt],
                    max_length=128,
                    num_beams=5, 
                    #early_stopping=True,
                    #no_repeat_ngram_size=3,  
                    #repetition_penalty=1.2
                )
                
            translation=tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
            translations.append(translation)

        output = " ".join(translations)
        
        
        if audio_output is True:
            print("yes")
            play_translation_audio(output, tgt_in)
            
        return output
    else:
        return("Invalid input at least one of the three has to be set: text, audio_file or audio_input")


### Testing the functions

In [108]:
translate_with_lang_detect2(text_in="Hello, my name is Simon and i am an astronaut and i really love skiing in the mountains of switzerland and to drink afterwards a good hot chocolate.", tgt_in="de")

'hallo mein name ist Simon und ich bin ein astronaut und ich liebe es in den schwitzischen berg zu skien und danach eine gute heiße chocolate zu trinken'

In [107]:
translate_with_lang_detect3(text_in="Hello, my name is Simon and i am an astronaut and i really love skiing in the mountains of switzerland and to drink afterwards a good hot chocolate.", tgt_in="de")

'hallo mein name ist Simon und ich bin ein astronaut und ich liebe es in den schwitzischen Bergen zu skien und danach eine gute heiße chokolade zu trinken'

In [86]:
translate_with_lang_detect2(tgt_in="de")

'Invalid input at least one of the three has to be set: text, audio_file or audio_input'

In [88]:
translate_with_lang_detect2(audio_file="recording11_23_2025_01_02_17.wav",tgt_in="de", audio_output=True)

 Hello hello how are you doing? I'm fine thank you how are you?
yes
 > Text splitted to sentences.
['hallo hallo wie gehts dir ich bin in Ordnung danke wie gehts dir']
 > Processing time: 0.9087660312652588
 > Real-time factor: 0.20433891121511416


'hallo hallo wie gehts dir ich bin in Ordnung danke wie gehts dir'

In [89]:
translate_with_lang_detect2(text_in="Hello hello how are you doing? I'm fine thank you how are you?",tgt_in="es", audio_output=True)

yes
 > Text splitted to sentences.
['hola hola como estás estoy bien gracias como estas']
 > Processing time: 0.1888437271118164
 > Real-time factor: 0.050348280407423486


'hola hola como estás estoy bien gracias como estas'

In [105]:
translate_with_lang_detect3(text_in="Hello hello how are you doing? I'm fine thank you how are you?",tgt_in="es", audio_output=True)

yes
 > Text splitted to sentences.
['hola como te va estoy bien gracias como estas']
 > Processing time: 0.1624300479888916
 > Real-time factor: 0.0460121089177166


'hola como te va estoy bien gracias como estas'

In [112]:
translate_with_lang_detect(text_in="Hello hello how are you doing? I'm fine thank you how are you?",tgt_in="de", audio_output=True)

yes
 > Text splitted to sentences.
['Hallo, hello, wie bist du dabei?', "I'm fine thank you how are you?"]
 > Processing time: 1.6352949142456055
 > Real-time factor: 0.2677168928123931


"Hallo, hello, wie bist du dabei? I'm fine thank you how are you?"

In [111]:
translate_with_lang_detect4(text_in="Hello hello how are you doing? I'm fine thank you how are you?",tgt_in="de", audio_output=False)

'Hallo, Hallo, wie geht es dir? Ich bin in Ordnung. Danke, wie geht es dir?'

In [114]:
translate_with_lang_detect4(text_in='Hallo, Hallo, wie geht es dir? Ich bin in Ordnung. Danke, wie geht es dir?'
,tgt_in="en", audio_output=False)

'Hello, how are you? I am all right. Thank you, how are you?'

In [115]:
translate_with_lang_detect3(text_in='Hallo, Hallo, wie geht es dir? Ich bin in Ordnung. Danke, wie geht es dir?'
,tgt_in="en", audio_output=False)

'hello, hello how are you i am fine thank you how are you'

In [117]:
translate_with_lang_detect3(text_in='hello, hello how are you i am fine thank you how are you',tgt_in="de", audio_output=False)

'hallo hallo wie gehts dir ich bin gut danke wie gehts dir'

In [118]:
translate_with_lang_detect3(text_in='Hallo, Hallo, wie geht es dir? Mir geht es gut. Danke, wie geht es dir?'
,tgt_in="en", audio_output=False)

'hello, hello how are you i am fine thank you how are you'

In [90]:
#print(sounddevice.query_devices())

## 5. Conclusion
As can be seen both iterations show very different outputs while the first iteration was good at translating longer passages the second iteration needed some help by splitting longer text into short since it was not trained on text containing punctuation for a third iteration it would be great to use a dataset containing correct syntax as well as modern text. Due to the time constraint of each fine tuning iteration taking around 10 to 15 hours and issues in the first few iterations there was simply no time for a third iteration. The second iteration is definitely an improvement in modern speech but sadly has no punctuation and does not recognize which words should be capitalized and which not. 