<a href="https://colab.research.google.com/github/ravadhani/NLP/blob/main/Transformers_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -U datasets sacrebleu transformers[sentencepiece] --force install

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacrebleu
  Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.41.0-py3-none-any.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting install
  Downloading install-1.3.5-py3-none-any.whl (3.2 kB)
Collecting filelock (from datasets)
  Downloading filelock-3.14.0-py3-none-any.whl (12 kB)
Collecting numpy>=1.17 (from datasets)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.

In [1]:
from datasets import load_dataset, load_metric
from transformers import pipeline

#load dataset
raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.05M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/210173 [00:00<?, ? examples/s]

In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

In [3]:
print(f"Size of labelled pair data: {raw_datasets['train'].num_rows}")

Size of labelled pair data: 210173


**Train and Test split**

In [4]:
#perform train-test split on the "train" split
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)

#rename the test "key" to "validation"
split_datasets["validation"] = split_datasets.pop("test")


In [5]:
split_datasets["train"][1]

{'id': '152754',
 'translation': {'en': 'Default to expanded threads',
  'fr': 'Par défaut, développer les fils de discussion'}}

In [6]:
#taking a look at couple of elements of split dataset.
#slicing using python we are xtracting 10, 12, 14 and 16th values of "translation"
split_datasets["train"][10:18:2]["translation"]

[{'en': 'Text Cursor Movement', 'fr': 'Mouvements du curseur de texte'},
 {'en': '2004-09-15 3.10.00', 'fr': '2004-09-15 3.10.00'},
 {'en': 'Reload the namespaces from the server. This overwrites any changes.',
  'fr': 'Recharger les espaces de noms depuis le serveur. Cette action écrasera toutes les modifications effectuées.'},
 {'en': 'Credit Card Tracker', 'fr': 'Traqueur de carte de créditName'}]

In [7]:
#let us just take one value for now

split_datasets["validation"][10]["translation"]

{'en': 'Read from Valgrind process failed.',
 'fr': 'Impossible de lire depuis le processus Valgrind.'}

**Load the pre-trained model**

In [8]:
#model name
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"

#load model
translator = pipeline("translation", model = model_checkpoint)

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



Translation from the pretrained model

In [9]:
tmp_data = split_datasets["train"][172]["translation"]
tmp_translation = translator(tmp_data['en'])

print(f"Original English Text: `{tmp_data['en']}`")
print(tmp_translation)
print(f"Original French Text: `{tmp_data['fr']}")

Original English Text: `Unable to import %1 using the OFX importer plugin. This file is not the correct format.`
[{'translation_text': "Impossible d'importer %1 en utilisant le plugin d'importateur OFX. Ce fichier n'est pas le bon format."}]
Original French Text: `Impossible d'importer %1 en utilisant le module d'extension d'importation OFX. Ce fichier n'a pas un format correct.


Now using Transformers for the same.

In [10]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from datasets import load_dataset
import tensorflow as tf

# Example tokenizer and model initialization
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)




tf_model.h5:   0%|          | 0.00/301M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [11]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors = 'tf')  #returns tokens as tensors suitable for input to translation model


In [12]:
print("Preprocessing one sample looks like this \n")

en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]

#as_target_tokenizer() will set the tokenizer in the output language
inputs = tokenizer(en_sentence)
with tokenizer.as_target_tokenizer():
  targets = tokenizer(fr_sentence)

wrong_targets = tokenizer(fr_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(targets["input_ids"]))


Preprocessing one sample looks like this 

['▁Par', '▁dé', 'f', 'aut', ',', '▁dé', 've', 'lop', 'per', '▁les', '▁fil', 's', '▁de', '▁discussion', '</s>']
['▁Par', '▁défaut', ',', '▁développer', '▁les', '▁fils', '▁de', '▁discussion', '</s>']




**Preprocessing**

Format the data to input to the transformer.

In [13]:
max_input_length = 128
max_target_length = 128

def preprocess_function(examples):
  inputs = [ex["en"] for ex in examples["translation"]]
  targets = [ex["fr"] for ex in examples["translation"]]
  model_inputs = tokenizer(inputs, max_length=max_input_length,
                           padding="max_length", truncation=True)

  #set up the tokenizer for targets
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(text_target=targets, max_length=max_target_length,
                       padding="max_length", truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs


In [14]:
split_datasets["train"].column_names

['id', 'translation']

In [15]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Map:   0%|          | 0/189155 [00:00<?, ? examples/s]

Map:   0%|          | 0/21018 [00:00<?, ? examples/s]

**Model Initialization**

In [16]:
from transformers import DataCollatorForSeq2Seq
from transformers import TFAutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding=True, return_tensors="tf")

All PyTorch model weights were used when initializing TFMarianMTModel.

All the weights of TFMarianMTModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [17]:
#lets see content of data_collator
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [18]:
#lets see how the decoder input is mapped to id
batch["decoder_input_ids"]

<tf.Tensor: shape=(2, 128), dtype=int64, numpy=
array([[59513,   577,  5891,     2,  3184,    16,  2542,     5,  1710,
            0, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
        59513, 59513, 59513, 

In [19]:
import tensorflow as tf

# Define the function to cast 'labels' to int64
def cast_labels(features):
     # Print the types before casting for debugging
    for key, value in features.items():
        print(f"Before casting - {key}: {value.dtype}")

    # Cast labels to int64
    features["labels"] = tf.cast(features["labels"], tf.int64)

    # Print the types after casting for debugging
    for key, value in features.items():
        print(f"After casting - {key}: {value.dtype}")

    return features

#convert train dataset to TensorFlow dataset
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns = ["input_ids", "attention_mask", "labels"],
    collate_fn = data_collator,
    shuffle = True,
    batch_size = 32,
    drop_remainder = True, #ensure consistent batch sizes
)

#apply the cast_labels function
tf_train_dataset = tf_train_dataset.map(cast_labels)

#convert the eval dataset to TensorFlow dataset
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns = ["input_ids", "attention_mask", "labels"],
    collate_fn = data_collator,
    shuffle = False,
    batch_size = 16,
    drop_remainder = True, #ensure consistent batch sizes
)

#apply the cast_labels function
tf_eval_dataset = tf_eval_dataset.map(cast_labels)


TypeError: Cannot convert [array([2.4251e+04, 1.4000e+01, 6.0000e+00, 2.7740e+04, 1.8020e+03,
       7.4920e+03, 7.4000e+01, 1.3252e+04, 1.6820e+03, 1.2000e+01,
       8.5500e+03, 1.4550e+03, 1.5000e+01, 7.6000e+01, 1.1460e+03,
       1.7680e+03, 1.0886e+04, 1.6548e+04, 1.0741e+04, 0.0000e+00,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04, 5.9513e+04,
       5.9513e+04, 5.9513e+04, 5.9513e+04])] to EagerTensor of dtype int64

In [28]:
# Ensure columns exist
for column in ["input_ids", "attention_mask", "labels"]:
    if column not in tokenized_datasets["train"].features:
        raise ValueError(f"Column '{column}' is missing from the dataset")


In [27]:
print(tokenized_datasets["train"][0])


{'input_ids': [34378, 226, 5783, 32, 200, 12, 3647, 4, 1223, 1628, 117, 4923, 23608, 3, 1789, 2942, 20059, 301, 548, 301, 331, 30, 117, 4923, 12, 4, 1528, 668, 3, 5734, 212, 9319, 30, 4, 4923, 57, 5487, 30, 4, 6, 32712, 25, 7243, 1160, 12, 621, 42, 4, 1156, 3009, 3, 0, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0

In [21]:
print(tokenized_datasets["train"].features)
print(tokenized_datasets["validation"].features)


{'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
{'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


In [22]:
print(tokenized_datasets["train"][:5])
print(tokenized_datasets["validation"][:5])


{'input_ids': [[34378, 226, 5783, 32, 200, 12, 3647, 4, 1223, 1628, 117, 4923, 23608, 3, 1789, 2942, 20059, 301, 548, 301, 331, 30, 117, 4923, 12, 4, 1528, 668, 3, 5734, 212, 9319, 30, 4, 4923, 57, 5487, 30, 4, 6, 32712, 25, 7243, 1160, 12, 621, 42, 4, 1156, 3009, 3, 0, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513], [47591, 12, 9842, 19634, 9, 0, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 5951

In [None]:
{'input_ids': [[34378, 226, 5783, 32, 200, 12, 3647, 4, 1223,
                1628, 117, 4923, 23608, 3, 1789, 2942, 20059, 301, 548, 301, 331, 30,
                117, 4923, 12, 4, 1528, 668, 3, 5734, 212, 9319, 30, 4, 4923, 57, 5487,
                30, 4, 6, 32712, 25, 7243, 1160, 12, 621, 42, 4, 1156, 3009, 3, 0],
                 [47591, 12, 9842, 19634, 9, 0], [1211, 3, 49, 9409, 1211, 3, 29140,
              817, 3124, 817, 28149, 139, 33712, 25218, 0], [596, 1682, 0], [135, 607, 2054,
           2, 3482, 10, 2843, 21048, 26, 67, 478, 0]],
  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[60, 7418, 5244, 8234, 740, 4993, 8, 6471, 5, 2218, 29, 193, 2220, 742, 3, 4366, 14237, 14, 6, 16600, 301, 548, 301, 331, 5, 193, 24275, 17, 8, 668, 6142, 3, 33640, 36, 81, 6, 5411, 2709, 9376, 22, 24275, 59, 36, 19, 9376, 153, 402, 29033, 13774, 402, 29033, 416, 27, 8, 4034, 4888, 3, 0], [577, 5891, 2, 3184, 16, 2542, 5, 1710, 0], [1211, 3, 49, 9409, 1211, 3, 29140, 817, 3124, 817, 550, 7032, 5821, 7907, 12649, 0], [4194, 442, 0], [4322, 5, 30508, 2, 5, 4403, 11, 5, 6676, 27, 66, 478, 0]]}
{'input_ids': [[18466, 10, 741, 3118, 9016, 9, 0], [17921, 3317, 12812, 2559, 0], [160, 9049, 86, 1500, 15, 33602, 3089, 1374, 12, 4, 16494, 3, 6369, 9086, 746, 110, 12, 39296, 4, 14777, 7, 67, 9049, 3, 1963, 33, 61, 32, 18871, 26, 411, 4118, 7907, 9, 12, 256, 67, 1507, 26, 6372, 16494, 9, 246, 3, 0], [301, 548, 37, 304, 12815, 124, 12, 7445, 457, 1834, 1769, 3, 44172, 31994, 4586, 3878, 331, 0], [35, 1156, 3009, 18, 4, 48767, 32, 12, 5139, 993, 12, 6121, 4, 6137, 18, 15, 1437, 57, 3583, 61, 1980, 12, 15, 402, 4627, 50, 3, 213, 86, 79, 12, 11523, 4, 1437, 13632, 57, 2752, 1144, 12, 3583, 4, 6137, 1229, 12, 3, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[10773, 20, 6, 1549, 5, 14, 6, 8543, 11, 22, 644, 0], [42691, 108, 19, 2454, 738, 0], [335, 15973, 3435, 63, 34, 1574, 16829, 17, 14, 6, 29180, 3, 9538, 1648, 16036, 139, 110, 27, 33614, 14, 6, 18412, 5, 66, 15973, 3, 16468, 265, 68, 6, 107, 43, 4772, 27, 8, 1565, 13, 7907, 9, 3397, 20, 6, 4482, 497, 936, 27, 38892, 810, 16, 32239, 9, 3, 0], [301, 548, 402, 38492, 3800, 5, 17783, 437, 19, 14776, 30885, 3, 19440, 51, 13840, 4586, 3878, 331, 0], [277, 14, 6, 11627, 3204, 2, 14, 6, 35852, 88, 434, 5, 4094, 62, 6, 107, 283, 23, 10583, 10880, 19, 5789, 31, 34, 2428, 59, 14, 6, 17261, 1701, 17, 38, 8926, 3, 344, 19617, 34656,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   19, 689, 22, 2428, 59, 14, 6, 4620, 2755, 17, 915, 7978, 8, 8966, 5, 5789, 3, 0]]}

In [19]:
import numpy as np
def check_for_non_integer_labels(dataset, name):
    print(f"Checking {name} dataset for non-integer labels")
    for i, example in enumerate(dataset):
        labels = example["labels"]
        if any(type(label) not in (int, np.int32, np.int64) for label in labels):
            print(f"Non-integer label found in {name} dataset at index {i}: {labels}")
            return
    print(f"No non-integer labels found in {name} dataset")

# Check the train and validation datasets
check_for_non_integer_labels(tokenized_datasets["train"], "train")
check_for_non_integer_labels(tokenized_datasets["validation"], "validation")


Checking train dataset for non-integer labels
No non-integer labels found in train dataset
Checking validation dataset for non-integer labels
No non-integer labels found in validation dataset


In [21]:
def ensure_integer_labels(dataset, name):
    def convert_to_int(examples):
        examples["labels"] = [int(label) for label in examples["labels"]]
        return examples
    return dataset.map(convert_to_int)

def ensure_correct_types(dataset, name):
    def convert_types(examples):
        examples["input_ids"] = [int(x) for x in examples["input_ids"]]
        examples["attention_mask"] = [int(x) for x in examples["attention_mask"]]
        examples["labels"] = [int(x) for x in examples["labels"]]
        return examples
    return dataset.map(convert_types)

# Ensure all fields have correct types in the tokenized datasets
tokenized_datasets["train"] = ensure_correct_types(tokenized_datasets["train"], "train")
tokenized_datasets["validation"] = ensure_correct_types(tokenized_datasets["validation"], "validation")


Map:   0%|          | 0/189155 [00:00<?, ? examples/s]

Map:   0%|          | 0/21018 [00:00<?, ? examples/s]

In [22]:
# Function to print dataset types and values for debugging
def debug_dataset(dataset, name):
    print(f"Debugging {name} dataset")
    for i, example in enumerate(dataset):
        for key, value in example.items():
            print(f"{key}: {type(value[0])}, {value[:5]}")
        if i >= 4:  # Inspect the first 5 examples
            break

# Debug the train and validation datasets
debug_dataset(tokenized_datasets["train"], "train")
debug_dataset(tokenized_datasets["validation"], "validation")


Debugging train dataset
input_ids: <class 'int'>, [34378, 226, 5783, 32, 200]
attention_mask: <class 'int'>, [1, 1, 1, 1, 1]
labels: <class 'int'>, [60, 7418, 5244, 8234, 740]
input_ids: <class 'int'>, [47591, 12, 9842, 19634, 9]
attention_mask: <class 'int'>, [1, 1, 1, 1, 1]
labels: <class 'int'>, [577, 5891, 2, 3184, 16]
input_ids: <class 'int'>, [1211, 3, 49, 9409, 1211]
attention_mask: <class 'int'>, [1, 1, 1, 1, 1]
labels: <class 'int'>, [1211, 3, 49, 9409, 1211]
input_ids: <class 'int'>, [596, 1682, 0]
attention_mask: <class 'int'>, [1, 1, 1]
labels: <class 'int'>, [4194, 442, 0]
input_ids: <class 'int'>, [135, 607, 2054, 2, 3482]
attention_mask: <class 'int'>, [1, 1, 1, 1, 1]
labels: <class 'int'>, [4322, 5, 30508, 2, 5]
Debugging validation dataset
input_ids: <class 'int'>, [18466, 10, 741, 3118, 9016]
attention_mask: <class 'int'>, [1, 1, 1, 1, 1]
labels: <class 'int'>, [10773, 20, 6, 1549, 5]
input_ids: <class 'int'>, [17921, 3317, 12812, 2559, 0]
attention_mask: <class 'int'

In [30]:
import tensorflow as tf
from transformers import DataCollatorForSeq2Seq, AutoTokenizer

# Assuming tokenizer is already defined
# tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")

# Function to cast labels to int64 with detailed debugging
def cast_labels(features):
    for key, value in features.items():
        print(f"Before casting - {key}: dtype={value.dtype}, shape={value.shape}, values={value.numpy()[:5]}")
    try:
        features["labels"] = tf.cast(features["labels"], tf.int64)
    except Exception as e:
        print(f"Error casting labels: {e}")
        for key, value in features.items():
            print(f"Error with - {key}: dtype={value.dtype}, shape={value.shape}, values={value.numpy()[:5]}")
        raise
    for key, value in features.items():
        print(f"After casting - {key}: dtype={value.dtype}, shape={value.shape}, values={value.numpy()[:5]}")
    return features


# Convert the train dataset to TensorFlow dataset
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
    drop_remainder=True,
)


# Apply the cast_labels function with detailed debugging to the train dataset
tf_train_dataset = tf_train_dataset.map(cast_labels)

# Convert the eval dataset to TensorFlow dataset
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16,
    drop_remainder=True,
)

# Apply the cast_labels function with detailed debugging to the eval dataset
tf_eval_dataset = tf_eval_dataset.map(cast_labels)


TypeError: Cannot convert [array([22279.,  5894.,     0., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.,
       59513., 59513., 59513., 59513., 59513., 59513., 59513., 59513.])] to EagerTensor of dtype int64

In [1]:
next(iter(tokenized_datasets))

NameError: name 'tokenized_datasets' is not defined

**Model Compilation**

- Define optimizer
- learning rate
- number of epochs

In [None]:
from transformers import create_optimizer
import tensorflow as tf

#the number of training steps is the number of samples in the dataset, divided by the batch size then
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.

num_epochs = 3
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
      init_lr=5e-4,
      num_warmup_steps=0,
      num_train_steps=num_train_steps,
      weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

**Model Training**

In [20]:
model.fit(
    tf_train_dataset,
    validation_data = tf_eval_dataset,
    epochs = num_epochs,
)
model.ave_pretrained('/content/drive/MyDrive/scalar-en-to-fr')

NameError: name 'tf_train_dataset' is not defined

**Model Inferencing**

In [None]:
model = TFAutoModelForSeq2SeqLM.from_pretrained('/content/drive/MyDrive/scalar-en-to-fr')

pipe = pipeline(task = 'translation',  #replace with whatever task is needed
                model = model,
                tokenizer = tokenizer)

In [None]:
#test
pipe('Unable to import %1 using the OFX importer plugin. This file is not the correct format.')

**Measuring the translation quality:**

**BLUE Score**

- BLEU is a metric to quantify effectiveness of an Machine Translation (MT).
- It stands for BiLingual Evaluation Understudy
- It solves the problem of different human translation references by different annotators when comparing to machine generated translation.

In [21]:
from datasets import load_metric

metric = load_metric("sacrebleu")


  metric = load_metric("sacrebleu")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [None]:
import numpy as np

def compute_metrics(model):
  all_preds = []
  all_labels = []

  #generate sample dataset into tf_dataset format from validation data
  sampled_dataset = tokenized_datasets["validation"].shuffle().select(range(200))
  tf_generate_dataset = sampled_dataset.to_tf_dataset(
      columns=["input_ids", "attention_mask", "labels"],
      collate_fn=data_collator,
      shuffle=False,
      batch_size=4,
  )

  #generate translation
  for batch in tf_generate_dataset:

        # predictions
        predictions = model.generate(
            input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
        )

        decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        labels = batch["labels"].numpy()

        # removing padding pad_id = -100
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

        # Ids To text
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        decoded_preds = [pred.strip() for pred in decoded_preds]
        decoded_labels = [[label.strip()] for label in decoded_labels]
        all_preds.extend(decoded_preds)
        all_labels.extend(decoded_labels)

    result = metric.compute(predictions=all_preds, references=all_labels)
    return {"bleu": result["score"]}

In [None]:
print(compute_metrics(model))