<a href="https://colab.research.google.com/github/qmeng222/transformers-for-NLP/blob/main/Seq2Seq/Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# install libraries:
!pip install transformers datasets sentencepiece
# `transformers` library: for using pre-trained models
# `datasets` library: to access a collection of high-quality datasets for NLP tasks
# `sentencepiece` library: to tokenize text into subwords

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
In

# Load the dataset & split it into training and testing sets:

The process of loading a dataset might implicitly involve downloading it if the dataset is not already present on your system.

In [2]:
from datasets import load_dataset # from the library, import the function

# [Reference] possible language pairs: https://opus.nlpl.eu/KDE4.php
data = load_dataset("kde4", lang1="en", lang2="fr") # load a dataset named "kde4" with specific language configurations
data

Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.05M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/210173 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

👆 Too many samples!

In [3]:
small = data["train"].shuffle(seed=42).select(range(1_000)) # create a smaller, shuffled subset of the training data
# .shuffle(): this method shuffles the training examples
# seed=42: the seed parameter is set to 42 to ensure reproducibility
# .select(range(1_000)): selects the first 20,000 examples (0-999)
small

Dataset({
    features: ['id', 'translation'],
    num_rows: 1000
})

In [6]:
# split dataset ('small') into training and testing sets:
split = small.train_test_split(seed=42) # ensure the split is reproducible
split

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 750
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 250
    })
})

In [8]:
# check the 1st example from the training set:
split["train"][0]

{'id': '169005',
 'translation': {'en': '& Reduce Tree', 'fr': "& Refermer l' arborescence"}}

# Tokenize the training data:

In [9]:
from transformers import AutoTokenizer # import the class, enabling dynamic loading of tokenizer for a specific pre-trained model

checkpoint = "Helsinki-NLP/opus-mt-en-fr" # model identifier (specify the name of a pre-trained model)
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # automatically load the appropriate tokenizer

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



In [10]:
# check the 6th example from the training set:
split["train"][5]

{'id': '46472',
 'translation': {'en': 'You can either pick a file or enter its name in the Location: box.',
  'fr': 'Vous pouvez soit choisir un fichier soit saisir son nom dans la zone de texte Emplacement.'}}

In [16]:
# extract the English and French translation texts from the 6th example (idx=5):
en = split['train'][5]['translation']['en']
fr = split['train'][5]['translation']['fr']
en, fr

('You can either pick a file or enter its name in the Location: box.',
 'Vous pouvez soit choisir un fichier soit saisir son nom dans la zone de texte Emplacement.')

In [18]:
# tokenize the English translation text -> the input tokens (into a machine learning model)
inputs = tokenizer(en)
inputs

{'input_ids': [213, 115, 1828, 8437, 15, 1437, 57, 3307, 96, 1129, 18, 4, 4577, 37, 5311, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

👆 Note: it comes with an attention mask

In [21]:
# tokenize the French translation text -> the target tokens
targets = tokenizer(text_target=fr) # NOTE: must specify `text_target` here!
targets

{'input_ids': [344, 1069, 345, 4094, 34, 2428, 345, 9315, 113, 689, 31, 8, 1283, 5, 1470, 21708, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

👆 `targets` is a dictionary obtained from tokenizing some text,

and 'input_ids' is a key in this dictionary.

The value associated with this key is **a sequence of token IDs**.

In [20]:
# just for checking purpose, convert ids back into string tokens:
# the `convert_ids_to_tokens` method is a functionality provided by `tokenizer`
# it takes a sequence of token ids & returns the corresponding tokens in a human-readable format
tokenizer.convert_ids_to_tokens(targets['input_ids'])

['▁Vous',
 '▁pouvez',
 '▁soit',
 '▁choisir',
 '▁un',
 '▁fichier',
 '▁soit',
 '▁saisir',
 '▁son',
 '▁nom',
 '▁dans',
 '▁la',
 '▁zone',
 '▁de',
 '▁texte',
 '▁Emplacement',
 '.',
 '</s>']

Matches the print out in line 10 🎉

one word -> one string token ✅

In [22]:
# (wrong) demo: What will happen if we didn't specify `text_target` in line 21?
bad_targets = tokenizer(fr) # should be `targets = tokenizer(text_target=fr)` instead
tokenizer.convert_ids_to_tokens(bad_targets['input_ids'])

['▁V',
 'ous',
 '▁po',
 'uv',
 'ez',
 '▁so',
 'it',
 '▁cho',
 'is',
 'ir',
 '▁un',
 '▁fi',
 'chi',
 'er',
 '▁so',
 'it',
 '▁s',
 'ais',
 'ir',
 '▁son',
 '▁no',
 'm',
 '▁dans',
 '▁la',
 '▁zone',
 '▁de',
 '▁text',
 'e',
 '▁Em',
 'placement',
 '.',
 '</s>']

👆 Not completely fail.

However, one word -> multiple string tokens ❌