# English to German translation using the wmt16 dataset and marianMT tranformer

### Loading the dataset

The dataset is downloaded directly from the hugging face library interface using the 'datasets' library.</br>
Once downloaded the dataset will be present in the cache memory of the notebook and can be accessed for future use.

In [1]:
#Downloading wmt16 dataset 
from datasets import load_dataset, load_metric
raw_data = load_dataset("wmt16", "de-en")
raw_data

Reusing dataset wmt16 (C:\Users\Sharanya Manohar\.cache\huggingface\datasets\wmt16\de-en\1.0.0\9e0038fe4cc117bd474d2774032cc133e355146ed0a47021b2040ca9db4645c0)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 4548885
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2169
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2999
    })
})

### Defining the tokenizer

Pre-defined tokenizer of the pre-trained "opus-mt-en-de" model by Helsinki NLP is used to tokenize the text in the dataset.

In [2]:
model_marianMT = "Helsinki-NLP/opus-mt-en-de"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_marianMT,use_fast=False)

In [3]:
#Unlike t5 transformer , marian MT does not require an aciton prefix
prefix = "" 

### Defining the pre-processing function

In [4]:
#Refer hugging face documentations for language codes
source_language = "en"
target_language = "de"

max_input_length = 128
max_target_length = 128

def preprocess(instances):
   input = [prefix + i[source_language] for i in instances["translation"]]
   target = [i[target_language] for i in instances["translation"]]
   tokenized_inputs = tokenizer(input, max_length=max_input_length, truncation=True)
   # Setup the tokenizer for target
   with tokenizer.as_target_tokenizer():
       label = tokenizer(target, max_length=max_target_length, truncation=True)
   tokenized_inputs["labels"] = label["input_ids"]
   return tokenized_inputs

In [5]:
#Applying the pre processing on the entire dataset
tokenized_datasets = raw_data.map(preprocess, batched=True)

  0%|          | 0/4549 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

### Creating subsets of the dataset for faster training

In [6]:
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

### Using the 'marianMT' pre-trained model

In [7]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_marianMT)

In [8]:
batch_size = 16

#defining training attributes
args = Seq2SeqTrainingArguments(
   output_dir="./results",
   evaluation_strategy = "epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=1,
   predict_with_generate=True   
)

In [9]:
#pad inputs and label them
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [10]:
import numpy as np
from datasets import load_metric
metric = load_metric("sacrebleu")
meteor = load_metric('meteor')

#customizing compute_metrics function to display bleu score, mean prediction length and meteor score
def compute_metrics(eval_preds):
   preds, labels = eval_preds
   if isinstance(preds, tuple):
       preds = preds[0]
    
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   # Replacing -100 in the labels as they are not needed and cannot be decoded
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
   
   decoded_preds = [pred.strip() for pred in decoded_preds]
   decoded_labels = [[label.strip()] for label in decoded_labels]
   
   result = metric.compute(predictions=decoded_preds, references=decoded_labels)
   meteor_result = meteor.compute(predictions=decoded_preds, references=decoded_labels)
   prediction_length = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
   
   result = {'bleu' : result['score']}
   result["gen_len"] = np.mean(prediction_length)
   result["meteor"] = meteor_result["meteor"]
   result = {x: round(y, 4) for x, y in result.items()}
   return result

[nltk_data] Downloading package wordnet to C:\Users\Sharanya
[nltk_data]     Manohar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Sharanya
[nltk_data]     Manohar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Sharanya
[nltk_data]     Manohar\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [11]:
#training object with customized parameters
trainer = Seq2SeqTrainer(
   model,
   args,
   train_dataset=train_dataset,
   eval_dataset=eval_dataset,
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)


In [12]:
#train model using train function 
trainer.train()

The following columns in the training set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation. If translation are not expected by `MarianMTModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 63


Epoch,Training Loss,Validation Loss,Bleu,Gen Len,Meteor
1,No log,1.173238,36.3813,26.842,0.5012


The following columns in the evaluation set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation. If translation are not expected by `MarianMTModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=63, training_loss=2.1024366106305803, metrics={'train_runtime': 1991.2929, 'train_samples_per_second': 0.502, 'train_steps_per_second': 0.032, 'total_flos': 19455542820864.0, 'train_loss': 2.1024366106305803, 'epoch': 1.0})