<a href="https://colab.research.google.com/github/10AcademyG4/data-warehouse-for-llm-finetuning/blob/hugging_face/sentencepiece_python_module_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.filterwarnings("ignore")

# Sentencepiece python module


This notebook describes comprehensive examples of sentencepiece Python module.
Since Python module calls C++ API through SWIG,  this document is also useful for developing C++ client.

## Install and data preparation

We use the small training data (botchan.txt) in this example.
([Botchan](https://en.wikipedia.org/wiki/Botchan) is a novel written by Natsume Sōseki in 1906.  The sample is English-translated one.)

## Basic  end-to-end example



In [2]:
# !pip install sentencepiece
# !wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

In [3]:
import os

In [4]:
import sentencepiece as spm

# train sentencepiece model from `amharic.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
# Print the current working directory
print("Current working directory:", os.getcwd())

# Check if the file exists in the current working directory
if os.path.exists('amharic.txt'):
    spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m --vocab_size=238')
else:
    print("The file 'amharic.txt' does not exist in the current working directory.")

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# English: "In Kenya, for the first time, prisoners voted in the election."

# encode: text => id
print(sp.encode_as_pieces('በኬንያ በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))
print(sp.encode_as_ids('በኬንያ  በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))

Current working directory: /home/hilla/code/10Academy-training/week4/llm_datawarehouse/notebooks/tokenization
['▁በ', 'ኬንያ', '▁በ', 'ኬንያ', '▁የ', 'ም', 'ር', 'ጫ', '▁', 'ታ', 'ሪ', 'ክ', '▁ለመ', 'ጀ', 'መ', 'ሪ', 'ያ', '▁', 'ጊ', 'ዜ', '▁የ', 'ህ', 'ግ', '▁', 'ታ', 'ራ', 'ሚ', 'ዎ', 'ች', '▁', 'ድ', 'ም', 'ጽ', '▁ሰ', 'ጡ', '፡፡']
[7, 126, 7, 126, 16, 6, 8, 226, 3, 17, 19, 56, 149, 128, 59, 19, 21, 3, 203, 196, 16, 73, 166, 3, 17, 13, 176, 194, 43, 3, 42, 6, 0, 113, 131, 0]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m --vocab_size=238
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 238
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece

###### Explanation of the output above:
The output you're seeing is related to the training of a SentencePiece model on your `amharic.txt` file.

1. `Current working directory: /home/hilla/code/10Academy-training/week4/llm_datawarehouse/notebooks/tokenization`: This is the directory where your Python script is running.

2. `['▁በ', 'ኬንያ', '▁በ', 'ኬንያ', '▁የ', 'ም', 'ር', 'ጫ', '▁', 'ታ', 'ሪ', 'ክ', '▁ለመ', 'ጀ', 'መ', 'ሪ', 'ያ', '▁', 'ጊ', 'ዜ', '▁የ', 'ህ', 'ግ', '▁', 'ታ', 'ራ', 'ሚ', 'ዎ', 'ች', '▁', 'ድ', 'ም', 'ጽ', '▁ሰ', 'ጡ', '፡፡']`: This is a list of tokens generated by the SentencePiece model from a sentence in your `amharic.txt` file.

3. `[7, 126, 7, 126, 16, 6, 8, 226, 3, 17, 19, 56, 149, 128, 59, 19, 21, 3, 203, 196, 16, 73, 166, 3, 17, 13, 176, 194, 43, 3, 42, 6, 0, 113, 131, 0]`: This is a list of the corresponding IDs for each token in the previous list.

4. The rest of the output is logging information from the SentencePiece model training process. It includes details about the training configuration (e.g., `vocab_size: 238`, `model_type: UNIGRAM`, `character_coverage: 0.9995`, etc.) and the progress of the training (e.g., `EM sub_iter=0 size=225 obj=16.523 num_tokens=709 num_tokens/piece=3.15111`).

5. `trainer_interface.cc(687) LOG(INFO) Saving model: m.model` and `trainer_interface.cc(699) LOG(INFO) Saving vocabs: m.vocab`: These lines indicate that the trained model and vocabulary are being saved to the files `m.model` and `m.vocab`, respectively.

the amharic.txt file was used to train the SentencePiece model and generate the m.model and m.vocab files. The model learned to tokenize text and assign IDs to tokens based on the text in the amharic.txt file.

If you want to tokenize a new text and get the corresponding IDs, you should use the same SentencePiece model. The new text doesn't have to be in the amharic.txt file, but it should be in the same language and use the same script (in this case, Amharic and the Ge'ez script).

In [12]:
def file_to_ids(filename):
  # Read the text from the file
  with open(filename, 'r') as file:
    text = file.read()

  # Tokenize the text and get the IDs of the tokens
  ids = sp.encode_as_ids(text)
  return ids

# Use the function to get the IDs of a text in a file
ids = file_to_ids('new_amharic.txt')
print("IDs:", ids)

IDs: [109, 16, 223, 8, 163, 190, 183, 7, 28, 38, 14, 8, 6, 201, 114, 60, 30, 42, 46, 37, 26, 6, 0, 20]


In [13]:
# decode: id => text
pieces = ['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ', '▁ለ', 'መጀመሪያ', '▁ጊዜ', '▁የህግ', '▁ታ', 'ራ', 'ሚ', 'ዎች', '▁ድምጽ', '▁', 'ሰጡ', '፡፡']
print(sp.decode_pieces(pieces))
print(sp.decode_ids(ids))

በኬንያ▁የምርጫ▁ታሪክ ለመጀመሪያ▁ጊዜ▁የህግ▁ታራሚዎች▁ድምጽ ሰጡ፡፡
የእስራኤል የጦር ካቢኔ በኢራን ላይ እርምጃ እንዲወሰድ ተስማም ⁇ ል


Output Explanation:

*sp.decode_pieces(pieces)*: This function takes a list of tokens as input and returns a string where each token has been replaced with its corresponding text. The tokens are pieces of text that the SentencePiece model has learned to recognize during training. The underscore character ('▁') at the beginning of some tokens represents a space in the original text.

*p.decode_ids(ids)*: This function takes a list of IDs as input and returns a string where each ID has been replaced with its corresponding text. The IDs are unique identifiers that the SentencePiece model has assigned to each unique token in its vocabulary.

##### Encoding and Decoding
The process of encoding and decoding is a fundamental part of working with natural language processing (NLP) models.

Encoding: This is the process of converting raw text into a format (like tokens or IDs) that can be understood by a machine learning model. In this case, the SentencePiece model is used to break down the text into smaller pieces (tokens) and assign each unique token a unique ID. This is necessary because machine learning models don't understand raw text, they work with numerical data.

Decoding: This is the reverse process of encoding. It converts the model's output from the numerical format back into human-readable text. In this case, the SentencePiece model is used to convert a list of tokens or IDs back into text. This is necessary to interpret the model's output in a meaningful way.



In [23]:
# returns vocab size, which is the total number of unique tokens that the model has learned to recognize.
print(sp.get_piece_size())

# id <=> piece conversion

# takes an ID as input and returns the corresponding token.
print(sp.id_to_piece(7))

# takes a token as input and returns the corresponding ID.
print(sp.piece_to_id('▁በ'))

# returns 0 for unknown tokens (we can change the id for UNK)
'''
This function below tries to find the ID that corresponds to the token 'MUST_BE_UNKNOWN'. 
If the token is not in the model's vocabulary, the function returns 0, which is the ID for unknown tokens (<unk>).
'''
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# returns 126 for 'ኬንያ', we can use this to check if a token is in the vocabulary, we defined this before
print(sp.piece_to_id('ኬንያ'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.

'''
 For each ID, the loop prints the corresponding token and a boolean value that indicates whether the token is a control symbol. 
 Control symbols are special tokens that are used to control the behavior of the model, but they don't represent any actual text.
'''
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))

238
▁በ
7
0
126
<unk> False
<s> True
</s> True


## Loads model from byte stream

Sentencepiece's model file is just a serialized [protocol buffer](https://developers.google.com/protocol-buffers/). We can instantiate sentencepiece processor from byte object with **load_from_serialized_proto** method.

In [None]:
# !pip install tensorflow

In [26]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m --vocab_size=238')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default


print(sp.encode_as_ids('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))

# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡') + [sp.eos_id()])

bos= 1
eos= 2
unk= 0
pad= -1
[7, 126, 16, 6, 8, 226, 3, 17, 19, 56, 149, 128, 59, 19, 21, 3, 203, 196, 16, 73, 166, 3, 17, 13, 176, 194, 43, 3, 42, 6, 0, 113, 131, 0]
[1, 7, 126, 16, 6, 8, 226, 3, 17, 19, 56, 149, 128, 59, 19, 21, 3, 203, 196, 16, 73, 166, 3, 17, 13, 176, 194, 43, 3, 42, 6, 0, 113, 131, 0, 2]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m --vocab_size=238
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 238
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece

In [27]:
import tensorflow as tf

# Assumes that m.model is stored in non-Posix file system.
serialized_model_proto = tf.io.gfile.GFile('m.model', 'rb').read()

sp = spm.SentencePieceProcessor()
sp.load_from_serialized_proto(serialized_model_proto)

print(sp.encode_as_pieces('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡ '))

2024-05-30 09:12:53.300954: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-30 09:12:53.309954: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-30 09:12:53.373441: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


['▁በ', 'ኬንያ', '▁የ', 'ም', 'ር', 'ጫ', '▁', 'ታ', 'ሪ', 'ክ', '▁ለመ', 'ጀ', 'መ', 'ሪ', 'ያ', '▁', 'ጊ', 'ዜ', '▁የ', 'ህ', 'ግ', '▁', 'ታ', 'ራ', 'ሚ', 'ዎ', 'ች', '▁', 'ድ', 'ም', 'ጽ', '▁ሰ', 'ጡ', '፡፡']


##### Explanation of the code and output above:

This code is using the SentencePiece library to load a pre-trained subword tokenization model and then use that model to tokenize some text.

Here's a breakdown of what each part of the code does:

1. `import tensorflow as tf`: This line imports the TensorFlow library. TensorFlow is a machine learning framework, but in this code, it's being used for its file I/O capabilities.

2. `serialized_model_proto = tf.io.gfile.GFile('m.model', 'rb').read()`: This line reads the contents of the file `m.model` into a byte string. The `m.model` file is assumed to contain a SentencePiece model that has been serialized (i.e., converted into a format that can be stored or transmitted) using Protocol Buffers, which is a language-neutral, platform-neutral, extensible mechanism for serializing structured data.

3. `sp = spm.SentencePieceProcessor()`: This line creates a new `SentencePieceProcessor` object, which is used to load the trained model and tokenize text.

4. `sp.load_from_serialized_proto(serialized_model_proto)`: This line loads the trained model from the serialized Protocol Buffers string.

5. `print(sp.encode_as_pieces('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡ '))`: This line tokenizes the given Amharic text into a list of subwords (or "pieces") and prints the list.

The output of the code is the list of subwords that the SentencePiece model has split the text into. This is a common step in preparing text for use in natural language processing tasks, as it allows models to handle words that they haven't seen before by breaking them down into known subwords.

## User defined and control symbols

We can define special tokens (symbols) to tweak the DNN behavior through the tokens.   Typical examples are  [BERT](https://arxiv.org/abs/1810.04805)'s special symbols., e.g., [SEP] and [CLS].

There are two types of special tokens:

- **user defined symbols**: Always treated as one token in any context. These symbols can appear in the input sentence.
- **control symbol**:  We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.

For experimental purposes, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However,  we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text.

In [29]:
# Example of user defined symbols
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls>,<start> --vocab_size=238')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbols to appear in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>

['▁', 'this', '▁', 'is', '▁', 'a', '▁', 'test', '<sep>', '▁', 'hello', '▁', 'world', '<cls>']
3
4
3= <sep>
4= <cls>


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls>,<start> --vocab_size=238
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m_user
  model_type: UNIGRAM
  vocab_size: 238
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: <sep>
  user_defined_symbols: <cls>
  user_defined_symbols: <start>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  se

#### Explanation of the code and output above:
This code is using the SentencePiece library to train a new subword tokenization model with user-defined symbols, and then use that model to tokenize some text.

Here's a breakdown of what each part of the code does:

1. `spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls>,<start> --vocab_size=238')`: This line trains a new SentencePiece model on the text in `amharic.txt`. The model is saved to a file named `m_user.model`, and its vocabulary will contain 238 unique tokens. The `--user_defined_symbols=<sep>,<cls>,<start>` option tells SentencePiece to treat `<sep>`, `<cls>`, and `<start>` as user-defined symbols. These symbols will be treated as single tokens during tokenization, even if they are not surrounded by whitespace.

2. `sp_user = spm.SentencePieceProcessor()`: This line creates a new `SentencePieceProcessor` object, which is used to load the trained model and tokenize text.

3. `sp_user.load('m_user.model')`: This line loads the trained model from `m_user.model`.

4. `print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))`: This line tokenizes the given text into a list of subwords (or "pieces") and prints the list. The `<sep>` and `<cls>` user-defined symbols are treated as separate tokens.

5. The next three `print` statements print the ID of the `<sep>` and `<cls>` tokens and the tokens that correspond to the IDs 3 and 4. Unlike control symbols, user-defined symbols are not decoded to empty strings, so `sp_user.decode_ids([3])` and `sp_user.decode_ids([4])` return `<sep>` and `<cls>`, respectively.

- This code is similar to the previous one, but it uses control symbols instead of user-defined symbols. Control symbols and user-defined symbols are two different ways to handle special tokens in SentencePiece.

In the ouput, the `<sep>` and `<cls>` control symbols are treated as part of the preceding words (`test<sep>` and `world<cls>`) because control symbols in SentencePiece are not split from their adjacent tokens unless they are separated by a space.




In [31]:
# Example of control symbols
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=238')

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')

# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty

['▁', 'this', '▁', 'is', '▁', 'a', '▁', 'test<sep>', '▁', 'hello', '▁', 'world<cls>']
3
4
3= 
4= 


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=238
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m_ctrl
  model_type: UNIGRAM
  vocab_size: 238
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  control_symbols: <sep>
  control_symbols: <cls>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_al

 BOS/EOS (&lt;s&gt;, &lt;/s&gt;) are defined as control symbols, but we can define them as user defined symbols.

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are segmented. (default behavior)

sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are handled as one token.

['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
['▁', '<s>', '▁he', 'll', 'o', '</s>']


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: botchan.txt
  input_format: 
  model_prefix: m_bos_as_user
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: <s>
  user_defined_symbols: </s>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  har

## Manipulating BOS/EOS/EOS/PAD symbols

BOS, EOS, UNK, and PAD ids can be obtained with **bos_id()**, **eos_id()**,  **unk_id()**, and **pad_id()** methods. We can explicitly insert these ids as follows.

The BOS, EOS, UNK, and PAD token IDs that are being printed are not specific to the "Hello world" text, but rather are part of the SentencePiece model itself.

When you train a SentencePiece model, it creates a vocabulary of tokens, which includes the special control tokens like BOS, EOS, UNK, and PAD. These special tokens are assigned specific IDs within the model's vocabulary, and their IDs can be retrieved using the methods shown in the code:

* sp.bos_id(): Returns the ID of the Beginning of Sequence token.
* sp.eos_id(): Returns the ID of the End of Sequence token.
* sp.unk_id(): Returns the ID of the Unknown token.
* sp.pad_id(): Returns the ID of the Padding token.

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default

# print(sp.encode_as_ids('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))
# print(sp.encode_as_pieces('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))
# # Prepend or append bos/eos ids.
# print([sp.bos_id()] + sp.encode_as_ids('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡') + [sp.eos_id()])

bos= 1
eos= 2
unk= 0
pad= -1
[7, 798, 1055, 1152, 24, 1004, 166, 1492, 210, 19, 98, 57, 1767, 3, 532, 101]
['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ', '▁ለ', 'መጀመሪያ', '▁ጊዜ', '▁የህግ', '▁ታ', 'ራ', 'ሚ', 'ዎች', '▁ድምጽ', '▁', 'ሰጡ', '፡፡']
[1, 7, 798, 1055, 1152, 24, 1004, 166, 1492, 210, 19, 98, 57, 1767, 3, 532, 101, 2]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m --vocab_size=2000
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_pie

In [None]:
# Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbols to appear in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>

['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>


## Changing the vocab id and surface representation of UNK/BOS/EOS/PAD symbols

By default, UNK/BOS/EOS/PAD tokens and their ids are defined as follows:

|token|UNK|BOS|EOS|PAD|
---|---
|surface|&lt;unk&gt;|&lt;s&gt;|&lt;/s&gt;|&lt;pad&gt;|
|id|0|1|2|undefined (-1)|


We can change these mappings with **--{unk|bos|eos|pad}_id** and **--{unk|bos|eos|pad}_piece** flags.

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')
sp = spm.SentencePieceProcessor()
sp.load('m.model')


for id in range(4):
    print(sp.id_to_piece(id), sp.is_control(id))

[PAD] True
[UNK] False
[BOS] True
[EOS] True


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  

When -1 is set,  this special symbol is disabled. UNK must not be undefined.

In [None]:
# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=amharic.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))


0
0
0


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: -1
  eos_id: -1
  pad_id: -1
  

## Sampling and nbest segmentation for subword regularization

When **--model_type=unigram** (default) is used,  we can perform sampling and n-best segmentation for data augmentation. See subword regularization paper [[kudo18]](https://www.google.com/search?q=subword+regularization&rlz=1CAASUL_enJP841&oq=subword+regu&aqs=chrome.0.69i59j69i61j69i57j69i61l2j0.1571j0j7&sourceid=chrome&ie=UTF-8) for more detail.

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m --vocab_size=2000')

# Can obtain different segmentations per request.
# There are two hyperparameters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
  print(sp.sample_encode_as_pieces('በኬንያ የምርጫ ታሪክ', -1, 0.1))

for n in range(10):
  print(sp.sample_encode_as_ids('በኬንያ የምርጫ ታሪክ', -1, 0.1))

['▁', 'በ', 'ኬ', 'ን', 'ያ', '▁', 'የ', 'ም', 'ር', 'ጫ', '▁ታሪክ']
['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ']
['▁በ', 'ኬ', 'ን', 'ያ', '▁የ', 'ምርጫ', '▁ታሪክ']
['▁በ', 'ኬ', 'ን', 'ያ', '▁የ', 'ምርጫ', '▁ታሪክ']
['▁', 'በ', 'ኬ', 'ን', 'ያ', '▁የምርጫ', '▁ታሪክ']
['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታ', 'ሪ', 'ክ']
['▁በ', 'ኬንያ', '▁የ', 'ምርጫ', '▁ታሪክ']
['▁', 'በ', 'ኬንያ', '▁', 'የ', 'ምርጫ', '▁ታሪክ']
['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ']
['▁', 'በ', 'ኬንያ', '▁', 'የ', 'ምርጫ', '▁ታሪክ']
[8, 799, 1056, 1153]
[8, 439, 5, 30, 0, 38, 1109, 1153]
[8, 799, 6, 1109, 0, 23, 78, 40]
[8, 799, 6, 9, 13, 139, 1153]
[0, 18, 799, 1056, 1153]
[8, 439, 5, 30, 1056, 211, 78, 40]
[8, 799, 1056, 0, 23, 78, 40]
[8, 439, 5, 30, 0, 38, 9, 13, 139, 1153]
[0, 18, 799, 1056, 211, 78, 40]
[8, 799, 1056, 1153]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m --vocab_size=2000
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_pie

In [None]:
# get 10 best
print(sp.nbest_encode_as_pieces('በኬንያ የምርጫ ታሪክ', 10))
print(sp.nbest_encode_as_ids('በኬንያ የምርጫ ታሪክ', 10))

[['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ'], ['▁በ', 'ኬንያ', '▁የ', 'ምርጫ', '▁ታሪክ'], ['▁', 'በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ'], ['▁', 'በ', 'ኬንያ', '▁የ', 'ምርጫ', '▁ታሪክ'], ['▁በ', 'ኬ', 'ን', 'ያ', '▁የምርጫ', '▁ታሪክ'], ['▁በ', 'ኬንያ', '▁', 'የ', 'ምርጫ', '▁ታሪክ'], ['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታ', 'ሪ', 'ክ'], ['▁በ', 'ኬንያ', '▁የ', 'ም', 'ር', 'ጫ', '▁ታሪክ'], ['▁በ', 'ኬንያ', '▁የምርጫ', '▁', 'ታ', 'ሪ', 'ክ'], ['▁በ', 'ኬ', 'ን', 'ያ', '▁የ', 'ምርጫ', '▁ታሪክ']]
[[8, 799, 1056, 1153], [8, 799, 6, 1109, 1153], [0, 18, 799, 1056, 1153], [0, 18, 799, 6, 1109, 1153], [8, 439, 5, 30, 1056, 1153], [8, 799, 0, 38, 1109, 1153], [8, 799, 1056, 211, 78, 40], [8, 799, 6, 9, 13, 139, 1153], [8, 799, 1056, 0, 23, 78, 40], [8, 439, 5, 30, 6, 1109, 1153]]


## BPE (Byte pair encoding) model

Sentencepiece supports BPE (byte-pair-encoding) for subword segmentation with **--model_type=bpe** flag.   We do not find empirical differences in translation quality between BPE and unigram model, but unigram model can perform sampling and n-best segmentation. See subword regularization paper [[kudo18]](https://www.google.com/search?q=subword+regularization&rlz=1CAASUL_enJP841&oq=subword+regu&aqs=chrome.0.69i59j69i61j69i57j69i61l2j0.1571j0j7&sourceid=chrome&ie=UTF-8) for more detail.

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('በኬንያ የምርጫ ታሪክ'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5))  # returns an empty list.

*** BPE ***
['▁በ', 'ኬ', 'ንያ', '▁የምርጫ', '▁ታሪክ']
[]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m_bpe
  model_type: BPE
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_p

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')

print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('በኬንያ የምርጫ ታሪክ'))
print(sp_unigram.nbest_encode_as_pieces('በኬንያ የምርጫ ታሪክ', 5))

*** Unigram ***
['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ']
[['▁በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ'], ['▁በ', 'ኬንያ', '▁የ', 'ምርጫ', '▁ታሪክ'], ['▁', 'በ', 'ኬንያ', '▁የምርጫ', '▁ታሪክ'], ['▁', 'በ', 'ኬንያ', '▁የ', 'ምርጫ', '▁ታሪክ'], ['▁በ', 'ኬ', 'ን', 'ያ', '▁የምርጫ', '▁ታሪክ']]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m_unigram
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pa

## Character and word model

Sentencepiece supports character and word segmentation with **--model_type=char** and **--model_type=character** flags.

In `word` segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized.
We can apply different segmentation algorithms transparently without changing pre/post processors.

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_char --model_type=char --vocab_size=2000')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')

print(sp_char.encode_as_pieces('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))
print(sp_char.encode_as_ids('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))

['▁', 'በ', 'ኬ', 'ን', 'ያ', '▁', 'የ', 'ም', 'ር', 'ጫ', '▁', 'ታ', 'ሪ', 'ክ', '▁', 'ለ', 'መ', 'ጀ', 'መ', 'ሪ', 'ያ', '▁', 'ጊ', 'ዜ', '▁', 'የ', 'ህ', 'ግ', '▁', 'ታ', 'ራ', 'ሚ', 'ዎ', 'ች', '▁', 'ድ', 'ም', 'ጽ', '▁', 'ሰ', 'ጡ', '፡', '፡']
[3, 7, 177, 4, 18, 3, 6, 15, 9, 118, 3, 36, 44, 38, 3, 16, 12, 92, 12, 44, 18, 3, 128, 110, 3, 6, 40, 28, 3, 36, 33, 37, 49, 20, 3, 34, 15, 168, 3, 29, 140, 60, 60]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m_char --model_type=char --vocab_size=2000
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m_char
  model_type: CHAR
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  u

In [None]:
spm.SentencePieceTrainer.train('--input=amharic.txt --model_prefix=m_word --model_type=word --vocab_size=2000')

sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')

print(sp_word.encode_as_pieces('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))  # '.' will not be one token.
print(sp_word.encode_as_ids('በኬንያ የምርጫ ታሪክ ለመጀመሪያ ጊዜ የህግ ታራሚዎች ድምጽ ሰጡ፡፡'))

['▁በኬንያ', '▁የምርጫ', '▁ታሪክ', '▁ለመጀመሪያ', '▁ጊዜ', '▁የህግ', '▁ታራሚዎች', '▁ድምጽ', '▁ሰጡ፡፡']
[1559, 338, 427, 1323, 25, 630, 0, 900, 0]


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=amharic.txt --model_prefix=m_word --model_type=word --vocab_size=2000
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: amharic.txt
  input_format: 
  model_prefix: m_word
  model_type: WORD
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  u

## Text normalization

Sentencepiece provides the following general pre-defined normalization rules. We can change the normalizer with **--normaliation_rule_name=&lt;NAME&gt;** flag.

- **nmt_nfkc**: NFKC normalization with some additional normalization around spaces. (default)
- **nfkc: original**: NFKC normalization.
- **nmt_nfkc_cf**: nmt_nfkc + Unicode case folding (mostly lower casing)
- **nfkc_cf**: nfkc + Unicode case folding.
- **identity**: no normalization



In [None]:
# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('ＨＥＬＬＯ　ＷＯＲＬＤ.'))  # lower casing and normalization

['▁what', '▁else', '▁is', '▁there', '▁', 'hello', '▁world', '.']


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: botchan.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_i

The normalization is performed with user-defined string-to-string mappings and leftmost longest matching.
We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalization rules can be found in the data directory ([sample](https://raw.githubusercontent.com/google/sentencepiece/master/data/nfkc.tsv)). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.

Here's the example of custom normalization. The TSV file is fed with **--normalization_rule_tsv=&lt;FILE&gt;** flag.

In [None]:
def tocode(s):
    out = []
    for c in s:
        out.append(str(hex(ord(c))).replace('0x', 'U+'))
    return ' '.join(out)


# TSV format:  source Unicode code points <tab> target code points
# normalize "don't => do not,  I'm => I am"
with open('normalization_rule.tsv', 'w') as f:
  f.write(tocode("I'm") + '\t' + tocode("I am") + '\n')
  f.write(tocode("don't") + '\t' + tocode("do not") + '\n')

print(open('normalization_rule.tsv', 'r').read())

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv')

sp = spm.SentencePieceProcessor()
# m.model embeds the normalization rule compiled into an FST.
sp.load('m.model')
print(sp.encode_as_pieces("I'm busy"))  # normalized to `I am busy'
print(sp.encode_as_pieces("I don't know it."))  # normalized to 'I do not know it.'

U+49 U+27 U+6d	U+49 U+20 U+61 U+6d
U+64 U+6f U+6e U+27 U+74	U+64 U+6f U+20 U+6e U+6f U+74

['▁I', '▁am', '▁bu', 's', 'y']
['▁I', '▁do', '▁not', '▁know', '▁it', '.']


## Randomizing training data

Sentencepiece loads all the lines of training data into memory to train the model.  However, larger training data increases the training time and memory usage, though they are linear to the training data. When **--input_sentence_size=&lt;SIZE&gt;** is specified,  Sentencepiece randomly samples &lt;SIZE&gt; lines from the whole training data.   **--shuffle_input_sentence=false** disables the random shuffle and takes the first &lt;SIZE&gt; lines.

In [None]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

sp.encode_as_pieces('this is a test.')

sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: botchan.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 1000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -

['▁this', '▁is', '▁a', '▁t', 'est', '.']

## Vocabulary restriction

We can encode the text only using the tokens specified with **set_vocabulary** method.  The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).

In [None]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print(sp.encode_as_pieces('this is a test.'))

# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

# Aggregates the frequency of each token in the training data.
freq = {}
with open('botchan.txt', 'r') as f:
    for line in f:
        line = line.rstrip()
        for piece in sp.encode_as_pieces(line):
            freq.setdefault(piece, 0)
            freq[piece] += 1

# only uses the token appearing more than 1000 times in the training data.
vocabs = list(filter(lambda x: x in freq and freq[x] > 1000, vocabs))
sp.set_vocabulary(vocabs)
print(sp.encode_as_pieces('this is a test.'))

# reset the restriction
sp.reset_vocabulary()
print(sp.encode_as_pieces('this is a test.'))

## Extracting crossing-words pieces

Sentencepieces does not extract pieces crossing multiple words (here the `word` means the space delimited tokens). The piece will never contain the whitespace marker (_) in the middle.

**--split_by_whtespace=false** disables this restriction and allows to extract pieces crossing multiple words.  In CJK (Chinese/Japanese/Korean), this flag will not affect the final segmentation results so much as  words are not tokenized with whitespaces in CJK.

In [None]:
import re

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --split_by_whitespace=false')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

for piece in vocabs[0:500]:
    if re.match('\w+▁\w+', piece):
        print(piece)

ed▁to
s▁of
ing▁the
ed▁the
s▁and


sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=botchan.txt --model_prefix=m --vocab_size=2000 --split_by_whitespace=false
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: botchan.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 0
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1


## Training sentencepiece model from the word list with frequency

We can train the sentencepiece model from the pair of &lt;word, frequency&gt;. First, you make a TSV file where the first column is the word and the second column is the frequency. Then, feed this TSV file with **--input_format=tsv** flag. Note that when feeding TSV as training data, we implicitly assume that **--split_by_whtespace=true**.

In [None]:
freq = {}
with open('botchan.txt', 'r') as f:
  for line in f:
    line = line.rstrip()
    for piece in line.split():
      freq.setdefault(piece, 0)
      freq[piece] += 1

with open('word_freq_list.tsv', 'w') as f:
  for k, v in freq.items():
    f.write('%s\t%d\n' % (k, v))

spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

print(sp.encode_as_pieces('this is a test.'))

['▁this', '▁is', '▁a', '▁t', 'est', '.']


## Getting byte offsets of tokens

Sentencepiece keeps track of byte offset (span) of each token, which is useful for highlighting the token on top of unnormalized text.

We first need to install protobuf module as the byte offsets and all other meta data for segementation are encoded in protocol buffer.
**encode_as_serialized_proto** method returns serialized SentencePieceText proto. You can get the deserialized object by calling ParseFromString method.

The definition of SentencePieceText proto is found [here](https://github.com/google/sentencepiece/blob/3be3f2e11e2bb923c579c6be5e7335809341587f/src/sentencepiece.proto#L23).


In [None]:
!pip install protobuf

--2019-03-27 21:42:35--  https://raw.githubusercontent.com/google/sentencepiece/master/python/sentencepiece_pb2.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7382 (7.2K) [text/plain]
Saving to: ‘sentencepiece_pb2.py.1’


2019-03-27 21:42:35 (52.3 MB/s) - ‘sentencepiece_pb2.py.1’ saved [7382/7382]



In [None]:
from sentencepiece import sentencepiece_pb2

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

# One best result
spt = sentencepiece_pb2.SentencePieceText()
spt.ParseFromString(sp.encode_as_serialized_proto('ｈｅｌｌｏ')) # Full width hello

# begin/end (offsets) are pointing to the original input.
print(spt)

# Nbest results
nspt = sentencepiece_pb2.NBestSentencePieceText()
nspt.ParseFromString(sp.nbest_encode_as_serialized_proto('ｈｅｌｌｏ', 5))
# print(nspt)

text: "\357\275\210\357\275\205\357\275\214\357\275\214\357\275\217"
pieces {
  piece: "\342\226\201he"
  id: 28
  surface: "\357\275\210\357\275\205"
  begin: 0
  end: 6
}
pieces {
  piece: "ll"
  id: 98
  surface: "\357\275\214\357\275\214"
  begin: 6
  end: 12
}
pieces {
  piece: "o"
  id: 38
  surface: "\357\275\217"
  begin: 12
  end: 15
}



489