# Tokenization Tutorial

There are many NLP methods that require tokenized data as input, such as machine translation and word alignment. In this notebook, we will show how to use the different tokenizers and detokenizers that are available in Machine. Tokenizers implement either the `Tokenizer` abstract class or the `RangeTokenizer` abstract class. `Tokenizer` classes are used to segment a sequence into tokens. `RangeTokenizer` classes return ranges that mark where each each token occurs in the sequence. Detokenizers implement the `Detokenizer` abstract class.


## Tokenizing text

Let's start with a simple, whitespace tokenizer. This tokenizer is used to split a string at whitespace. This tokenizer is useful for text that has already been tokenized.


In [11]:
from machine.tokenization import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
tokens = tokenizer.tokenize("This is a test .")
print(" | ".join(tokens))

This | is | a | test | .


Machine contains general tokenizers that can be used to tokenize text from languages with a Latin-based script. A word tokenizer and a sentence tokenizer are available.


In [12]:
from machine.tokenization import LatinSentenceTokenizer, LatinWordTokenizer

sentence_tokenizer = LatinSentenceTokenizer()
sentences = sentence_tokenizer.tokenize(
    "Integer scelerisque efficitur dui, eu tincidunt erat posuere in. Curabitur vel finibus mi."
)
word_tokenizer = LatinWordTokenizer()
print("\n".join(" | ".join(word_tokenizer.tokenize(sentence)) for sentence in sentences))

Integer | scelerisque | efficitur | dui | , | eu | tincidunt | erat | posuere | in | .
Curabitur | vel | finibus | mi | .


Most tokenizers implement the `RangeTokenizer` interface. These tokenizers have an additional method, `tokenize_as_ranges`, that returns ranges that mark the position of all tokens in the original string.

In [13]:
word_tokenizer = LatinWordTokenizer()
sentence = '"This is a test, also."'
ranges = word_tokenizer.tokenize_as_ranges(sentence)
output = ""
prev_end = 0
for range in ranges:
    output += sentence[prev_end : range.start]
    output += f"[{sentence[range.start : range.end]}]"
    prev_end = range.end
print(output + sentence[prev_end:])

["][This] [is] [a] [test][,] [also][.]["]


There are some languages that do not delimit words with spaces, but instead delimit sentences with spaces. In these cases, it is common practice to use zero-width spaces to explicitly mark word boundaries. This is often done for Bible translations. Machine contains a word tokenizer that is designed to properly deal with text use zero-width space to delimit words and spaces to delimit sentences. Notice that the space is preserved, since it is being used as punctuation to delimit sentences.


In [14]:
from machine.tokenization import ZwspWordTokenizer

word_tokenizer = ZwspWordTokenizer()
tokens = word_tokenizer.tokenize("Lorem​Ipsum​Dolor​Sit​Amet​Consectetur Adipiscing​Elit​Sed")
print(" | ".join(tokens))

Lorem | Ipsum | Dolor | Sit | Amet | Consectetur |   | Adipiscing | Elit | Sed


Subword tokenization has become popular for use with deep learning models. Machine provides a [SentencePiece](https://github.com/google/sentencepiece) tokenizer that can perform both BPE and unigram subword tokenization. Another advantage of subword tokenization is that it is language-independent and allows one to specify the size of the vocabulary. This helps to deal with out-of-vocabulary issues. First, let's train a SentencePiece model. SentencePiece classes require the `sentencepiece` optional dependency.


In [15]:
import os
import sentencepiece as sp

os.makedirs("out", exist_ok=True)
sp.SentencePieceTrainer.Train(f"--input=data/en.txt --model_prefix=out/en-sp --vocab_size=400")

Now that we have a SentencePiece model, we can split the text into subwords.


In [16]:
from machine.tokenization.sentencepiece import SentencePieceTokenizer

tokenizer = SentencePieceTokenizer("out/en-sp.model")
tokens = tokenizer.tokenize("This is a test.")
print(" | ".join(tokens))

▁Th | is | ▁ | is | ▁a | ▁ | t | es | t | .


## Detokenizing text

For many NLP pipelines, tokens will need to be merged back into detokenized text. This is very common for machine translation. Many of the tokenizers in Machine also have a corresponding detokenizer that can be used to convert tokens back into a correct sequence. Once again, let's start with a simple, whitespace detokenizer.


In [17]:
from machine.tokenization import WhitespaceDetokenizer

detokenizer = WhitespaceDetokenizer()
sentence = detokenizer.detokenize(["This", "is", "a", "test", "."])
print(sentence)

This is a test .


Machine has a general detokenizer that works well with languages with a Latin-based script.


In [18]:
from machine.tokenization import LatinWordDetokenizer

detokenizer = LatinWordDetokenizer()
sentence = detokenizer.detokenize(['"', "This", "is", "a", "test", ",", "also", ".", '"'])
print(sentence)

"This is a test, also."


Machine has a detokenizer that properly deals with text that uses zero-width space to delimit words and spaces to delimit sentences.


In [19]:
from machine.tokenization import ZwspWordDetokenizer

word_detokenizer = ZwspWordDetokenizer()
sentence = word_detokenizer.detokenize(
    ["Lorem", "Ipsum", "Dolor", "Sit", "Amet", "Consectetur", " ", "Adipiscing", "Elit", "Sed"]
)
print(sentence)

Lorem​Ipsum​Dolor​Sit​Amet​Consectetur Adipiscing​Elit​Sed


Machine contains a detokenizer for SentencePiece encoded text. SentencePiece encodes spaces in the tokens, so that it can be detokenized without any ambiguities.


In [20]:
from machine.tokenization.sentencepiece import SentencePieceDetokenizer

detokenizer = SentencePieceDetokenizer()
sentence = detokenizer.detokenize(["▁Th", "is", "▁", "is", "▁a", "▁", "t", "es", "t", "."])
print(sentence)

This is a test.
