# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 3</a>
## Transformers - English to German Translation Example

We play with transformers in this exercise. For this notebook, we will install a library called "trax". This is the official library for the Transformer model from the "Attention is all you need" paper.

In [1]:
# Upgrade dependencies
!pip install -r ../../requirements.txt
!pip install gsutil

Collecting typing-extensions
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.0
    Uninstalling typing-extensions-3.10.0.0:
      Successfully uninstalled typing-extensions-3.10.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 1.0.61 requires nvidia-ml-py3, which is not installed.
spacy 3.0.6 requires pydantic<1.8.0,>=1.7.1, but you have pydantic 1.8.2 which is incompatible.
aiobotocore 1.3.0 requires botocore<1.20.50,>=1.20.49, but you have botocore 1.22.3 which is incompatible.[0m
Successfully installed typing-extensions-3.7.4.3
Collecting gsutil
  Using cached gsutil-5.5.tar.gz (2.9 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting argcomplete>=1.9.4
  Using ca

In [2]:
# import libraries
import os
import numpy as np
import trax

Let's use a pre-trained transformer model. We initialize the model with the weights for English-German translation. You can checkout their [Github repo](https://github.com/google/trax) for more details.

In [3]:
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
print("Defining model...")
model = trax.models.Transformer(
    input_vocab_size=33300,
    d_model=512,
    d_ff=2048,
    n_heads=8,
    n_encoder_layers=6,
    n_decoder_layers=6,
    max_len=2048,
    mode="predict",
)

# Initialize using pre-trained weights.
print("Initialize weights using pre-trained values...")
model.init_from_file(
    "gs://trax-ml/models/translation/ende_wmt32k.pkl.gz", weights_only=True
)

Defining model...
Initialize weights using pre-trained values...




In [4]:
# Tokenize a sentence.
print("Tokenize sentence...")
sentence = "The rabbit couldn’t cross the street because it was too tired"
print(f"Sentence: {sentence}\n")
tokenized = list(
    trax.data.tokenize(
        iter([sentence]),  # Operates on streams.
        vocab_dir="gs://trax-ml/vocabs/",
        vocab_file="ende_32k.subword",
    )
)[0]
print(f"Tokenized: {tokenized}")

Tokenize sentence...
Sentence: The rabbit couldn’t cross the street because it was too tired

Tokenized: [   29 13347  2579 20530    59    62  3410     4  3792   241    40    53
   361 19179    86]


Let's take a look at the (subword) vocabulary that the model uses.

In [5]:
# Download the file with the vocab for inspection
!gsutil cp gs://trax-ml/vocabs/ende_32k.subword .

Copying gs://trax-ml/vocabs/ende_32k.subword...
/ [1 files][313.8 KiB/313.8 KiB]                                                
Operation completed over 1 objects/313.8 KiB.                                    


In [6]:
# Load the vocab in a dict
vocab = {}
fi = open("ende_32k.subword", "r")
for i, line in enumerate(fi):
    vocab[i] = line.strip().replace("'", "")
    
print(f"\nLen of vocab: {len(vocab)}")


Len of vocab: 33288


In [7]:
print(f"Sentence: {sentence}\n")
print(f"Tokenized: {tokenized}\n")
print(f"Subword tokens: {[vocab[tk_id] for tk_id in tokenized]}")

Sentence: The rabbit couldn’t cross the street because it was too tired

Tokenized: [   29 13347  2579 20530    59    62  3410     4  3792   241    40    53
   361 19179    86]

Subword tokens: ['The_', 'rab', 'bit_', 'couldn_', '’_', 't_', 'cross_', 'the_', 'street_', 'because_', 'it_', 'was_', 'too_', 'tire', 'd_']


In [8]:
# Decode from the Transformer.
print("Decoding...")
tokenized = tokenized[None, :]  # Add batch dimension.
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
    model, tokenized, temperature=0.0
)  # Higher temperature: more diverse results.
print(tokenized_translation)

# De-tokenize,
print("Detokenizing...")
tokenized_translation = tokenized_translation[0][:-1]  # Remove batch and EOS.
translation = trax.data.detokenize(
    tokenized_translation,
    vocab_dir="gs://trax-ml/vocabs/",
    vocab_file="ende_32k.subword",
)
print(f"Final translation: {translation}")

Decoding...
[[  149  6660 11125  8869  4856  1770    10  4328    44  4374  7229    28
      2   424    33    18 28254    35   142     3     1]]
Detokenizing...
Final translation: Der Kaninchenblock konnte die Straße nicht überqueren, weil es zu müde war.


In [9]:
print(f"Tokenized translation: {tokenized_translation}\n")
print(f"Subword tokens: {[vocab[tk_id] for tk_id in tokenized_translation]}")

Tokenized translation: [  149  6660 11125  8869  4856  1770    10  4328    44  4374  7229    28
     2   424    33    18 28254    35   142     3]

Subword tokens: ['Der_', 'Kan', 'inc', 'hen', 'block_', 'konnte_', 'die_', 'Straße_', 'nicht_', 'über', 'quer', 'en_', ', _', 'weil_', 'es_', 'zu_', 'müd', 'e_', 'war_', '._']
