##### Copyright 2019 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Subword tokenizers

Here we demonstrate how to generate a subword vocabulary from a dataset, and use it to build a `text.BertTokenizer` from the vocabulary.

The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.

Objective: At the end of this notebook you'll have built a complete end-to-end wordpiece tokenizer and detokenizer from scratch, and saved it as a `saved_model` that you can load and use later for translator training.

## Overview

The `tensorflow_text` package includes TensorFlow implementations of many common tokenizers. In this notebook we will use `text.BertTokenizer`. The `BertTokenizer` class is a higher level interface. It includes BERT's token splitting algorithm and a `WordPieceTokenizer`. It takes **sentences** as input and returns **token-IDs**.

This notebook builds a Wordpiece vocabulary in a top down manner, starting from existing words.

## Setup

In [None]:
!pip install -q -U tensorflow-text

[K     |████████████████████████████████| 4.9 MB 5.1 MB/s 
[?25h

In [None]:
!pip install -q tensorflow_datasets

In [None]:
import collections
import os
import pathlib
import re
import string
import sys
import tempfile
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow_text as text
import tensorflow as tf

In [None]:
tf.get_logger().setLevel('ERROR')
pwd = pathlib.Path.cwd()

## Download the dataset

Fetch the French/Portuguese translation dataset from [tfds](https://tensorflow.org/datasets):

In [None]:
examples, metadata = tfds.load('ted_hrlr_translate/fr_to_pt', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']  

[1mDownloading and preparing dataset ted_hrlr_translate/fr_to_pt/1.0.0 (download: 124.94 MiB, generated: Unknown size, total: 124.94 MiB) to /root/tensorflow_datasets/ted_hrlr_translate/fr_to_pt/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/ted_hrlr_translate/fr_to_pt/1.0.0.incomplete839BAD/ted_hrlr_translate-train.tfrecord


  0%|          | 0/43873 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/ted_hrlr_translate/fr_to_pt/1.0.0.incomplete839BAD/ted_hrlr_translate-validation.tfrecord


  0%|          | 0/1131 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/ted_hrlr_translate/fr_to_pt/1.0.0.incomplete839BAD/ted_hrlr_translate-test.tfrecord


  0%|          | 0/1494 [00:00<?, ? examples/s]

[1mDataset ted_hrlr_translate downloaded and prepared to /root/tensorflow_datasets/ted_hrlr_translate/fr_to_pt/1.0.0. Subsequent calls will reuse this data.[0m


This dataset produces French/Portuguese sentence pairs:

In [None]:
for fr, pt in train_examples.take(1):
  print("French: ", fr.numpy().decode('utf-8'))
  print("Portuguese:   ", pt.numpy().decode('utf-8'))

French:  mais cela trahit aussi la panique , la terreur , que la grossophobie peut évoquer .
Portuguese:    mas também faz notar o pânico , o terror literal , que o medo da gordura evoca .


Note a few things about the example sentences above:
* They're lower case.
* There are spaces around the punctuation.
* It's not clear if or what unicode normalization is being used.

In [None]:
train_fr = train_examples.map(lambda fr, pt: fr)
train_pt = train_examples.map(lambda fr, pt: pt)

## Generate the vocabulary

This section generates a wordpiece vocabulary from a dataset.

The vocabulary generation code is included in the `tensorflow_text` pip package. It is not imported by default , you need to manually import it:

In [None]:
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

In [None]:
bert_tokenizer_params=dict(lower_case=True)
reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]

bert_vocab_args = dict(
    # The target vocabulary size
    vocab_size = 8000,
    # Reserved tokens that must be included in the vocabulary
    reserved_tokens=reserved_tokens,
    # Arguments for `text.BertTokenizer`
    bert_tokenizer_params=bert_tokenizer_params,
    # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
    learn_params={},
)

In [None]:
%%time
pt_vocab = bert_vocab.bert_vocab_from_dataset(
    train_pt.batch(1000).prefetch(2),
    **bert_vocab_args
)

CPU times: user 2min 7s, sys: 1.72 s, total: 2min 9s
Wall time: 2min 7s


Here are some slices of the resulting vocabulary.

In [None]:
print(pt_vocab[:10])
print(pt_vocab[100:110])
print(pt_vocab[1000:1010])
print(pt_vocab[-10:])

['[PAD]', '[UNK]', '[START]', '[END]', '!', '#', '$', '%', '&', "'"]
['no', 'por', 'na', 'mais', 'eu', 'esta', 'muito', 'isso', 'isto', 'sao']
['palestra', 'podera', 'velocidade', '##rem', '##tivo', 'alta', 'aumentar', 'coracao', 'fazia', 'modelos']
['##–', '##—', '##‘', '##’', '##“', '##”', '##⁄', '##€', '##♪', '##♫']


Write a vocabulary file:

In [None]:
def write_vocab_file(filepath, vocab):
  with open(filepath, 'w') as f:
    for token in vocab:
      print(token, file=f)

In [None]:
write_vocab_file('pt_vocab.txt', pt_vocab)

Use that function to generate a vocabulary from the French data:

In [None]:
%%time
fr_vocab = bert_vocab.bert_vocab_from_dataset(
    train_fr.batch(1000).prefetch(2),
    **bert_vocab_args
)


CPU times: user 2min 16s, sys: 1.45 s, total: 2min 18s
Wall time: 2min 22s


In [None]:
print(fr_vocab[:10])
print(fr_vocab[100:110])
print(fr_vocab[1000:1010])
print(fr_vocab[-10:])

['[PAD]', '[UNK]', '[START]', '[END]', '!', '#', '$', '%', '&', "'"]
['le', 'que', 'un', 'nous', 'des', 'en', 'une', 'vous', 'il', 'ce']
['piece', 'revenir', 'succes', 'uns', '##ction', 'abeilles', 'couleur', 'heureux', 'impact', 'ordinateurs']
['##–', '##—', '##‘', '##’', '##“', '##”', '##⁄', '##€', '##≈', '##♪']


Here are the two vocabulary files:

In [None]:
write_vocab_file('fr_vocab.txt', fr_vocab)

In [None]:
!ls *.txt

fr_vocab.txt  pt_vocab.txt


## Build the tokenizer
<a id="build_the_tokenizer"></a>

The `text.BertTokenizer` can be initialized by passing the vocabulary file's path as the first argument: 

In [None]:
pt_tokenizer = text.BertTokenizer('pt_vocab.txt', **bert_tokenizer_params)
fr_tokenizer = text.BertTokenizer('fr_vocab.txt', **bert_tokenizer_params)

Now you can use it to encode some text. Take a batch of 3 examples from the Portuguese data:

In [None]:
for pt_examples, fr_examples in train_examples.batch(3).take(1):
  for ex in pt_examples:
    print(ex.numpy())

b'mais cela trahit aussi la panique , la terreur , que la grossophobie peut \xc3\xa9voquer .'
b"mais le vrai probl\xc3\xa8me est le manque d'autres infrastructures ."
b'quand ils sont endommag\xc3\xa9s , par exemple par la fum\xc3\xa9e de cigarette , ils ne fonctionnent pas correctement , et ne peuvent pas expulser le mucus .'


Run it through the `BertTokenizer.tokenize` method. Initially, this returns a `tf.RaggedTensor` with axes `(batch, word, word-piece)`:

In [None]:
# Tokenize the examples -> (batch, word, word-piece)
token_batch = pt_tokenizer.tokenize(pt_examples)
# Merge the word and word-piece axes -> (batch, tokens)
token_batch = token_batch.merge_dims(-2,-1)

for ex in token_batch.to_list():
  print(ex)

[103, 42, 3784, 59, 340, 3996, 377, 40, 1068, 94, 247, 178, 55, 1794, 247, 1124, 14, 178, 146, 1037, 5126, 14, 84, 178, 46, 1290, 700, 731, 1758, 5426, 1746, 194, 377, 44, 1608, 1124, 149, 16]
[103, 2730, 61, 340, 247, 55, 566, 1181, 5915, 268, 1553, 377, 2730, 1815, 333, 1124, 43, 9, 40, 194, 3476, 6750, 5527, 149, 194, 681, 377, 2627, 94, 16]
[56, 2056, 2538, 48, 3795, 180, 3083, 44, 208, 132, 599, 1602, 94, 14, 1944, 44, 6548, 132, 731, 897, 1944, 178, 45, 3208, 268, 83, 42, 4476, 441, 2683, 417, 14, 48, 3795, 53, 268, 45, 1061, 681, 3815, 948, 3083, 55, 496, 3235, 681, 1251, 1167, 377, 14, 44, 377, 53, 268, 1746, 194, 810, 3083, 55, 496, 44, 849, 731, 194, 3795, 647, 2730, 52, 194, 681, 1068, 16]


In [None]:
# Lookup each token id in the vocabulary.
txt_tokens = tf.gather(pt_vocab, token_batch)
# Join with spaces.
tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1)

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'mais c ##ela t ##ra ##hi ##t a ##us ##s ##i la p ##an ##i ##que , la ter ##re ##ur , que la g ##ros ##so ##p ##ho ##bie pe ##u ##t e ##vo ##que ##r .',
       b"mais le v ##ra ##i p ##ro ##b ##lem ##e es ##t le ma ##n ##que d ' a ##u ##tres infra ##st ##r ##u ##c ##t ##ure ##s .",
       b'q ##ua ##nd i ##ls so ##nt e ##ndo ##m ##ma ##ge ##s , par e ##xe ##m ##p ##le par la f ##ume ##e de c ##ig ##ar ##et ##te , i ##ls n ##e f ##on ##c ##tion ##ne ##nt p ##as corre ##c ##tem ##en ##t , e ##t n ##e pe ##u ##ve ##nt p ##as e ##x ##p ##u ##ls ##er le m ##u ##c ##us .'],
      dtype=object)>

To re-assemble words from the extracted tokens, use the `BertTokenizer.detokenize` method:

In [None]:
words = pt_tokenizer.detokenize(token_batch)
tf.strings.reduce_join(words, separator=' ', axis=-1)

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'mais cela trahit aussi la panique , la terreur , que la grossophobie peut evoquer .',
       b"mais le vrai probleme est le manque d ' autres infrastructures .",
       b'quand ils sont endommages , par exemple par la fumee de cigarette , ils ne fonctionnent pas correctement , et ne peuvent pas expulser le mucus .'],
      dtype=object)>

## Customization and export

Here we build the text tokenizer and detokenizer used later for Transformer training. This section adds methods and processing steps to simplify that notebook, and exports the tokenizers using `tf.saved_model` so they can be imported.

### Custom tokenization

The tokenized text have to include `[START]` and `[END]` tokens.

The `reserved_tokens` reserve space at the beginning of the vocabulary, so `[START]` and `[END]` have the same indexes for both languages:

In [None]:
START = tf.argmax(tf.constant(reserved_tokens) == "[START]")
END = tf.argmax(tf.constant(reserved_tokens) == "[END]")

def add_start_end(ragged):
  count = ragged.bounding_shape()[0]
  starts = tf.fill([count,1], START)
  ends = tf.fill([count,1], END)
  return tf.concat([starts, ragged, ends], axis=1)

In [None]:
words = pt_tokenizer.detokenize(add_start_end(token_batch))
tf.strings.reduce_join(words, separator=' ', axis=-1)

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'[START] mais cela trahit aussi la panique , la terreur , que la grossophobie peut evoquer . [END]',
       b"[START] mais le vrai probleme est le manque d ' autres infrastructures . [END]",
       b'[START] quand ils sont endommages , par exemple par la fumee de cigarette , ils ne fonctionnent pas correctement , et ne peuvent pas expulser le mucus . [END]'],
      dtype=object)>

### Custom detokenization

Before exporting the tokenizers there are a couple of things you can cleanup for the downstream notebooks:

1. They want to generate clean text output, so drop reserved tokens like `[START]`, `[END]` and `[PAD]`.
2. They're interested in complete strings, so apply a string join along the `words` axis of the result.  

In [None]:
def cleanup_text(reserved_tokens, token_txt):
  # Drop the reserved tokens, except for "[UNK]".
  bad_tokens = [re.escape(tok) for tok in reserved_tokens if tok != "[UNK]"]
  bad_token_re = "|".join(bad_tokens)
    
  bad_cells = tf.strings.regex_full_match(token_txt, bad_token_re)
  result = tf.ragged.boolean_mask(token_txt, ~bad_cells)

  # Join them into strings.
  result = tf.strings.reduce_join(result, separator=' ', axis=-1)

  return result

In [None]:
pt_examples.numpy()

array([b'mais cela trahit aussi la panique , la terreur , que la grossophobie peut \xc3\xa9voquer .',
       b"mais le vrai probl\xc3\xa8me est le manque d'autres infrastructures .",
       b'quand ils sont endommag\xc3\xa9s , par exemple par la fum\xc3\xa9e de cigarette , ils ne fonctionnent pas correctement , et ne peuvent pas expulser le mucus .'],
      dtype=object)

In [None]:
token_batch = pt_tokenizer.tokenize(pt_examples).merge_dims(-2,-1)
words = pt_tokenizer.detokenize(token_batch)
words

<tf.RaggedTensor [[b'mais', b'cela', b'trahit', b'aussi', b'la', b'panique', b',', b'la', b'terreur', b',', b'que', b'la', b'grossophobie', b'peut', b'evoquer', b'.'], [b'mais', b'le', b'vrai', b'probleme', b'est', b'le', b'manque', b'd', b"'", b'autres', b'infrastructures', b'.'], [b'quand', b'ils', b'sont', b'endommages', b',', b'par', b'exemple', b'par', b'la', b'fumee', b'de', b'cigarette', b',', b'ils', b'ne', b'fonctionnent', b'pas', b'correctement', b',', b'et', b'ne', b'peuvent', b'pas', b'expulser', b'le', b'mucus', b'.']]>

In [None]:
cleanup_text(reserved_tokens, words).numpy()

array([b'mais cela trahit aussi la panique , la terreur , que la grossophobie peut evoquer .',
       b"mais le vrai probleme est le manque d ' autres infrastructures .",
       b'quand ils sont endommages , par exemple par la fumee de cigarette , ils ne fonctionnent pas correctement , et ne peuvent pas expulser le mucus .'],
      dtype=object)

### Export

The following code block builds a `CustomTokenizer` class to contain the `text.BertTokenizer` instances, the custom logic, and the `@tf.function` wrappers required for export. 

In [None]:
class CustomTokenizer(tf.Module):
  def __init__(self, reserved_tokens, vocab_path):
    self.tokenizer = text.BertTokenizer(vocab_path, lower_case=True)
    self._reserved_tokens = reserved_tokens
    self._vocab_path = tf.saved_model.Asset(vocab_path)

    vocab = pathlib.Path(vocab_path).read_text().splitlines()
    self.vocab = tf.Variable(vocab)

    ## Create the signatures for export:   

    # Include a tokenize signature for a batch of strings. 
    self.tokenize.get_concrete_function(
        tf.TensorSpec(shape=[None], dtype=tf.string))
    
    # Include `detokenize` and `lookup` signatures for:
    #   * `Tensors` with shapes [tokens] and [batch, tokens]
    #   * `RaggedTensors` with shape [batch, tokens]
    self.detokenize.get_concrete_function(
        tf.TensorSpec(shape=[None, None], dtype=tf.int64))
    self.detokenize.get_concrete_function(
          tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))

    self.lookup.get_concrete_function(
        tf.TensorSpec(shape=[None, None], dtype=tf.int64))
    self.lookup.get_concrete_function(
          tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))

    # These `get_*` methods take no arguments
    self.get_vocab_size.get_concrete_function()
    self.get_vocab_path.get_concrete_function()
    self.get_reserved_tokens.get_concrete_function()
    
  @tf.function
  def tokenize(self, strings):
    enc = self.tokenizer.tokenize(strings)
    # Merge the `word` and `word-piece` axes.
    enc = enc.merge_dims(-2,-1)
    enc = add_start_end(enc)
    return enc

  @tf.function
  def detokenize(self, tokenized):
    words = self.tokenizer.detokenize(tokenized)
    return cleanup_text(self._reserved_tokens, words)

  @tf.function
  def lookup(self, token_ids):
    return tf.gather(self.vocab, token_ids)

  @tf.function
  def get_vocab_size(self):
    return tf.shape(self.vocab)[0]

  @tf.function
  def get_vocab_path(self):
    return self._vocab_path

  @tf.function
  def get_reserved_tokens(self):
    return tf.constant(self._reserved_tokens)

Build a `CustomTokenizer` for each language:

In [None]:
tokenizers = tf.Module()
tokenizers.pt = CustomTokenizer(reserved_tokens, 'pt_vocab.txt')
tokenizers.fr = CustomTokenizer(reserved_tokens, 'fr_vocab.txt')

Export the tokenizers as a `saved_model`:

In [None]:
model_name = 'ted_hrlr_translate_fr_pt_converter'
tf.saved_model.save(tokenizers, model_name)

Reload the `saved_model` and test the methods:

In [None]:
reloaded_tokenizers = tf.saved_model.load(model_name)
reloaded_tokenizers.fr.get_vocab_size().numpy()

7139

In [None]:
tokens = reloaded_tokenizers.fr.tokenize(["Je m'appelle Vera"])
tokens.numpy()

array([[   2,  110,   50,    9,  303, 5147,  220,    3]])

In [None]:
text_tokens = reloaded_tokenizers.fr.lookup(tokens)
text_tokens

<tf.RaggedTensor [[b'[START]', b'je', b'm', b"'", b'appelle', b'ver', b'##a', b'[END]']]>

In [None]:
round_trip = reloaded_tokenizers.fr.detokenize(tokens)

print(round_trip.numpy()[0].decode('utf-8'))

je m ' appelle vera


Archive it for the translation notebook:

In [None]:
!zip -r {model_name}.zip {model_name}

  adding: ted_hrlr_translate_fr_pt_converter/ (stored 0%)
  adding: ted_hrlr_translate_fr_pt_converter/variables/ (stored 0%)
  adding: ted_hrlr_translate_fr_pt_converter/variables/variables.index (deflated 33%)
  adding: ted_hrlr_translate_fr_pt_converter/variables/variables.data-00000-of-00001 (deflated 52%)
  adding: ted_hrlr_translate_fr_pt_converter/saved_model.pb (deflated 91%)
  adding: ted_hrlr_translate_fr_pt_converter/assets/ (stored 0%)
  adding: ted_hrlr_translate_fr_pt_converter/assets/fr_vocab.txt (deflated 57%)
  adding: ted_hrlr_translate_fr_pt_converter/assets/pt_vocab.txt (deflated 57%)


In [None]:
!du -h *.zip

164K	ted_hrlr_translate_fr_pt_converter.zip
