## Machine Translation Tasks

- Model used AutoModelForSeq2SeqLM instead of AutoModelForSequenceClassification
- Use DataCollatorForSeq2Seq instead of DataCollatoWithPadding
- For training args :  Seq2SeqTrainingArguments
- Trainer : Seq2SeqTrainer



- Dataset and tokenizer : Both inputs and targets needs to be tokenized
- Metrics : BLEU Score and BERT Score


Q -  Do we need two separate tokenizer in machine translation task?
-   In a machine translation task, you typically don't need two separate tokenizers; you can use a single tokenizer. However, it's important to clarify the roles of tokenization in machine translation.

Tokenization in machine translation refers to the process of breaking down a sequence of text into smaller units, typically words or subword units, to prepare the text for processing by a neural machine translation model. This tokenizer is used for both the source (input) language and the target (output) language. Here's how it works:

1. Source Language Tokenization: The text in the source language (the text you want to translate) is tokenized into smaller units, which are often words or subword units. The source text is then encoded into a sequence of tokens, and these tokens are fed into the machine translation model as input.

2. Target Language Tokenization: Similarly, the text in the target language (the translation output) is tokenized using the same tokenizer. The model generates translations in the form of tokens, which can then be decoded into words or phrases in the target language.

By using a single tokenizer for both the source and target languages, you ensure that the tokenization process is consistent across the entire translation pipeline. This consistency is essential because it aligns the source and target sequences, making it easier for the model to learn the relationships between words in the two languages.

It's worth noting that in some cases, you might use different tokenizers for source and target languages, especially when dealing with languages that have significantly different structures or character sets. However, it can complicate the translation process and require additional handling to align the tokenization of source and target text. In most cases, a single tokenizer for both languages simplifies the task and is the more common approach in machine translation.

So we should choose appropriate checkpoint for your dataset , checkpoint defined what tokenizers you will get

Context manager ensures that correct tokenizer used for target language.

### Task : English - French Translation

In [5]:
from datasets import load_dataset

In [6]:
from transformers.trainer_utils import get_last_checkpoint

In [7]:
data = load_dataset("kde4",lang1="en",lang2="fr")

In [8]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

In [9]:
data = data['train'].shuffle(seed=42).select(range(210173)) 
# data = data['train'].shuffle(seed=42).select(range(1000))

In [10]:
data = data.train_test_split(seed=42)

In [11]:
data['train'][0]

{'id': '189716', 'translation': {'en': 'DeskJet 340', 'fr': 'DeskJet 340'}}

In [12]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 157629
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 52544
    })
})

In [13]:
from transformers import AutoTokenizer

In [14]:
checkpoint = "Helsinki-NLP/opus-mt-en-fr"

In [15]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [16]:
en = data['train'][5]['translation']['en']
fr = data['train'][5]['translation']['fr']

In [17]:
en,fr

('%1 attribute of %2 element must either contain %3 or the other values.',
 "L'attribut %1 de l'élément %2 doit contenir soit %3 soit les autres valeurs.")

In [18]:
inputs = tokenizer(en)

In [19]:
inputs

{'input_ids': [301, 548, 31891, 7, 301, 331, 5709, 280, 1828, 5019, 301, 602, 57, 4, 126, 2619, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [20]:
# with tokenizer.as_target_tokenizer():
#     targets=tokenizer(fr)
# targets

In [21]:
targets = tokenizer(text_target=fr)

In [22]:
targets

{'input_ids': [87, 6, 36543, 301, 548, 5, 14, 6, 12039, 301, 331, 283, 13403, 345, 301, 602, 345, 16, 214, 2218, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [23]:
tokenizer.convert_ids_to_tokens(targets['input_ids'])

['▁L',
 "'",
 'attribut',
 '▁%',
 '1',
 '▁de',
 '▁l',
 "'",
 'élément',
 '▁%',
 '2',
 '▁doit',
 '▁contenir',
 '▁soit',
 '▁%',
 '3',
 '▁soit',
 '▁les',
 '▁autres',
 '▁valeurs',
 '.',
 '</s>']

In [24]:
# word language
# bad_targets = tokenizer(fr)
# tokenizer.convert_ids_to_tokens(bad_targets['input_ids'])

### Model Inputs

In [25]:
max_input_len = 128
max_target_len = 128

In [26]:
def tokenizer_func(batch):
    inputs = [x["en"] for x in batch["translation"]]
    targets = [x["fr"] for x in batch["translation"]]

    tokenized_inputs = tokenizer(
        inputs, max_length=max_input_len,truncation=True
    )

    tokenized_targets = tokenizer(targets,
        max_length=max_target_len, truncation=True
    )

    tokenized_inputs['labels'] = tokenized_targets['input_ids']
    return tokenized_inputs

In [27]:
tokenized_dataset = data.map(tokenizer_func,batched=True,remove_columns = data['train'].column_names)

Map: 100%|██████████| 157629/157629 [00:24<00:00, 6400.39 examples/s]
Map: 100%|██████████| 52544/52544 [00:08<00:00, 6362.41 examples/s]


In [28]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 157629
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 52544
    })
})

In [29]:
from transformers import AutoModelForSeq2SeqLM

In [30]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [31]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("The Device Opted :",device)
model.to(device)

The Device Opted : cuda:0


MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(59514, 512, padding_idx=59513)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(59514, 512, padding_idx=59513)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,),

In [32]:
from transformers import DataCollatorForSeq2Seq

In [33]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer,model=model)

In [34]:
## Data collator expects the inputs a list of sample points
[tokenized_dataset["train"][i] for i in range(1,3)]

[{'input_ids': [37483, 0], 'attention_mask': [1, 1], 'labels': [37483, 0]},
 {'input_ids': [18666, 4004, 26, 526, 46235, 51, 0],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1],
  'labels': [304, 34902, 794, 5, 6126, 8749, 27, 526, 46235, 51, 0]}]

In [35]:
batch = data_collator([tokenized_dataset['train'][i] for i in range(1,3)])

In [36]:
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [37]:
batch['labels']

tensor([[37483,     0,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100],
        [  304, 34902,   794,     5,  6126,  8749,    27,   526, 46235,    51,
             0]])

In [38]:
batch['decoder_input_ids']

tensor([[59513, 37483,     0, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
         59513],
        [59513,   304, 34902,   794,     5,  6126,  8749,    27,   526, 46235,
            51]])

In [39]:
# first token is a pad
tokenizer.convert_ids_to_tokens(batch['decoder_input_ids'][0])

['<pad>',
 '▁Mono',
 '</s>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>']

In [40]:
tokenizer.convert_ids_to_tokens(batch['labels'][0])

['▁Mono',
 '</s>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>',
 '<unk>']

### Metrics

In [41]:
from datasets import load_metric

In [42]:
bleu_metric = load_metric("sacrebleu")
bert_metric = load_metric("bertscore")

  bleu_metric = load_metric("sacrebleu")


In [43]:
import numpy as np

In [44]:
def compute_metrics(preds_and_labels):
    preds,labels  = preds_and_labels
    # convert preds to words
    decoded_preds = tokenizer.batch_decode(preds,skip_special_tokens=True)

    # for any label -100 replace it to pad token id
    labels= np.where(labels!=-100, labels, tokenizer.pad_token_id)

    # convert labels to word
    decoded_labels = tokenizer.batch_decode(labels,skip_special_tokens=True)

    # get rid of whitespaces and put targets to list
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    bleu = bleu_metric.compute(
        predictions = decoded_preds,references = decoded_labels
    )
    bert_score = bert_metric.compute(
         predictions = decoded_preds,references = decoded_labels, lang="fr"
    )

    return {
        "bleu" : bleu['score'], "bert_Score":np.mean(bert_score['f1'])
    }

In [45]:
from transformers import Seq2SeqTrainingArguments

In [46]:
out_dir_models="en-fr-finetuned-model"

In [47]:
training_args = Seq2SeqTrainingArguments(
    output_dir=out_dir_models,
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_eval_batch_size=64,
    per_device_train_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
)

In [48]:
from transformers import Seq2SeqTrainer

In [49]:
tokenized_dataset['train']

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 157629
})

In [50]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [51]:
trainer.train()

Step,Training Loss
500,2.1196
1000,1.4062
1500,1.2131
2000,1.1248
2500,1.0616
3000,0.974
3500,0.9518
4000,0.9268
4500,0.9044
5000,0.8871


TrainOutput(global_step=7389, training_loss=1.0557832871484956, metrics={'train_runtime': 1470.0682, 'train_samples_per_second': 321.677, 'train_steps_per_second': 5.026, 'total_flos': 1.1642637608681472e+16, 'train_loss': 1.0557832871484956, 'epoch': 3.0})

In [52]:
trainer.evaluate(max_length=max_target_len)

{'eval_loss': 0.8138642311096191,
 'eval_bleu': 49.96864492771791,
 'eval_bert_Score': 0.8893282916725825,
 'eval_runtime': 2777.2537,
 'eval_samples_per_second': 18.919,
 'eval_steps_per_second': 0.296,
 'epoch': 3.0}

In [53]:
trainer.save_model("en_fr_translation_model")

In [58]:
from transformers import pipeline

In [59]:
en_fr_translator = pipeline("translation",model="/home/ubuntu/uzair/NLP/machine_translation_tasks/en_fr_translation_model",device=0)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [60]:
en_fr_translator("The quick brown fox jumps over the lazy dog.")

[{'translation_text': 'Le fox rond rapide se passe au-dessus du morceau paresseux.'}]

In [57]:
GroundTruth =  "Le renard brun rapide saute par-dessus le chien paresseux."

