# Tokenization Tutorial

There are many NLP methods that require tokenized data as input, such as machine translation and word alignment. In this notebook, we will show how to use the different tokenizers and detokenizers that are available in Machine. Tokenizers implement either the `ITokenizer` interface or the `IRangeTokenizer` interface. `ITokenizer` classes are used to segment a sequence into tokens. `IRangeTokenizer` classes return ranges that mark where each each token occurs in the sequence. Detokenizers implement the `IDetokenizer` interface.


In [1]:
#r "nuget:SIL.Scripture,12.0.1"
#r "../src/SIL.Machine/bin/Debug/netstandard2.0/SIL.Machine.dll"
#r "../src/SIL.Machine.Tokenization.SentencePiece/bin/Debug/netstandard2.0/SIL.Machine.Tokenization.SentencePiece.dll"

void WriteLine(string text = "")
{
    Console.Write(text + "\n");
}

## Tokenizing text

Let's start with a simple, whitespace tokenizer. This tokenizer is used to split a string at whitespace. This tokenizer is useful for text that has already been tokenized.


In [2]:
using SIL.Machine.Tokenization;

var tokenizer = new WhitespaceTokenizer();
var tokens = tokenizer.Tokenize("This is a test .");
WriteLine(string.Join(" | ", tokens));

This | is | a | test | .


Machine contains general tokenizers that can be used to tokenize text from languages with a Latin-based script. A word tokenizer and a sentence tokenizer are available.


In [3]:
var sentenceTokenizer = new LatinSentenceTokenizer();
var sentences = sentenceTokenizer.Tokenize(
    "Integer scelerisque efficitur dui, eu tincidunt erat posuere in. Curabitur vel finibus mi.");
var wordTokenizer = new LatinWordTokenizer();
WriteLine(string.Join("\n", sentences.Select(s => string.Join(" | ", wordTokenizer.Tokenize(s)))));

Integer | scelerisque | efficitur | dui | , | eu | tincidunt | erat | posuere | in | .
Curabitur | vel | finibus | mi | .


Most tokenizers implement the `IRangeTokenizer` interface. These tokenizers have an additional method, `TokenizeAsRanges`, that returns ranges that mark the position of all tokens in the original string.


In [4]:
var wordTokenizer = new LatinWordTokenizer();
var sentence = "\"This is a test, also.\"";
var ranges = wordTokenizer.TokenizeAsRanges(sentence);
var output = "";
var prev_end = 0;
foreach (var range in ranges)
{
    output += sentence.Substring(prev_end, range.Start - prev_end);
    output += $"[{sentence.Substring(range.Start, range.Length)}]";
    prev_end = range.End;
}
WriteLine(output + sentence.Substring(prev_end));

["][This] [is] [a] [test][,] [also][.]["]


There are some languages that do not delimit words with spaces, but instead delimit sentences with spaces. In these cases, it is common practice to use zero-width spaces to explicitly mark word boundaries. This is often done for Bible translations. Machine contains a word tokenizer that is designed to properly deal with text use zero-width space to delimit words and spaces to delimit sentences. Notice that the space is preserved, since it is being used as punctuation to delimit sentences.


In [5]:
var wordTokenizer = new ZwspWordTokenizer();
var tokens = wordTokenizer.Tokenize("Lorem​Ipsum​Dolor​Sit​Amet​Consectetur Adipiscing​Elit​Sed");
WriteLine(string.Join(" | ", tokens));

Lorem | Ipsum | Dolor | Sit | Amet | Consectetur |   | Adipiscing | Elit | Sed


Subword tokenization has become popular for use with deep learning models. Machine provides a [SentencePiece](https://github.com/google/sentencepiece) tokenizer that can perform both BPE and unigram subword tokenization. Another advantage of subword tokenization is that it is language-independent and allows one to specify the size of the vocabulary. This helps to deal with out-of-vocabulary issues. First, let's train a SentencePiece model. SentencePiece classes are implemented in the [SIL.Machine.Tokenization.SentencePiece](https://www.nuget.org/packages/SIL.Machine.Tokenization.SentencePiece/) package.


In [6]:
using System.IO;
using SIL.Machine.Tokenization.SentencePiece;

Directory.CreateDirectory("out");
var trainer = new SentencePieceTrainer
{
    VocabSize = 400,
    ModelType = SentencePieceModelType.Unigram
};
trainer.Train("data/en.txt", "out/en-sp")

Now that we have a SentencePiece model, we can split the text into subwords.


In [7]:
{
    using var tokenizer = new SentencePieceTokenizer("out/en-sp.model");
    var tokens = tokenizer.Tokenize("This is a test.");
    WriteLine(string.Join(" | ", tokens));
}

▁Th | is | ▁ | is | ▁a | ▁ | t | es | t | .


## Detokenizing text

For many NLP pipelines, tokens will need to be merged back into detokenized text. This is very common for machine translation. Many of the tokenizers in Machine also have a corresponding detokenizer that can be used to convert tokens back into a correct sequence. Once again, let's start with a simple, whitespace detokenizer.


In [8]:
var detokenizer = new WhitespaceDetokenizer();
var sentence = detokenizer.Detokenize(new[] { "This", "is", "a", "test", "." });
WriteLine(sentence);

This is a test .


Machine has a general detokenizer that works well with languages with a Latin-based script.


In [9]:
var wordDetokenizer = new LatinWordDetokenizer();
var sentence = wordDetokenizer.Detokenize(new[] { "\"", "This", "is", "a", "test", ",", "also", ".", "\"" });
WriteLine(sentence);

"This is a test, also."


Machine has a detokenizer that properly deals with text that uses zero-width space to delimit words and spaces to delimit sentences.


In [10]:
var wordDetokenizer = new ZwspWordDetokenizer();
var sentence = wordDetokenizer.Detokenize(
    new[] { "Lorem", "Ipsum", "Dolor", "Sit", "Amet", "Consectetur", " ", "Adipiscing", "Elit", "Sed" });
WriteLine(sentence);

Lorem​Ipsum​Dolor​Sit​Amet​Consectetur Adipiscing​Elit​Sed


Machine contains a detokenizer for SentencePiece encoded text. SentencePiece encodes spaces in the tokens, so that it can be detokenized without any ambiguities.


In [11]:
var detokenizer = new SentencePieceDetokenizer();
var sentence = detokenizer.Detokenize(new[] { "▁Th", "is", "▁", "is", "▁a", "▁", "t", "es", "t", "." });
WriteLine(sentence);

This is a test.
