<a href="https://colab.research.google.com/github/shichaog/GPT/blob/main/Sentencepiece_python_module_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentencepiece python module
This notebook decribes comprehensive examples of sentencepiece Python module. since Python module calls C++ API through SWIG, this document is also useful for developing c++ client.

This is a copy implementation of Sentencepiece example from google github. But with Chinese examples.


# mount Google Drive for reading《遮天》.txt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/My Drive/Colab Notebooks/
!ls zhe*

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks
zhetian.txt


# Install sentencepiece



In [None]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


# Basic end-to-end exmaple
Chinese don't use space to seprate words. The next code segments is using **unigram** method.When --model_type=unigram (default) is used, we can perform sampling and n-best segmentation for data augmentation. See subword regularization paper [kudo18] for more detail.

In [None]:
import sentencepiece as spm

# train sentencepiece model from `zhetian.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=zhetian.txt --model_prefix=m --vocab_size=3439')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('叶凡经历九龙抬棺'))
print(sp.encode_as_ids('叶凡经历九龙抬棺'))

# decode: id => text
print(sp.decode_pieces(['▁', '叶', '凡', '经', '历', '九', '龙', '抬', '棺']))
print(sp.decode_ids([388, 359, 295, 606, 117]))

['▁', '叶', '凡', '经', '历', '九', '龙', '抬', '棺']
[6, 388, 359, 295, 606, 117, 101, 967, 383]
叶凡经历九龙抬棺
叶凡经历九


In [8]:
# returns vocab size
print(sp.get_piece_size())


# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))

3439
0
<unk> False
<s> True
</s> True


# BPE (Byte pair encoding) model
Sentencepiece supports BPE (byte-pair-encoding) for subword segmentation with --model_type=bpe flag. We do not find empirical differences in translation quality between BPE and unigram model, but unigram model can perform sampling and n-best segmentation. See subword regularization paper [kudo18] for more detail.

In [13]:
spm.SentencePieceTrainer.train('--input=zhetian.txt --model_prefix=m_bpe --vocab_size=3439 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('叶凡经历九龙抬棺'))
print(sp.encode_as_ids('叶凡经历九龙抬棺'))
print(sp_bpe.nbest_encode_as_pieces('叶凡经历九龙抬棺', 5))  # returns an empty list.

*** BPE ***
['▁', '叶', '凡', '经', '历', '九', '龙', '抬', '棺']
[6, 388, 359, 295, 606, 117, 101, 967, 383]
[]


Character and word model
Sentencepiece supports character and word segmentation with --model_type=char and --model_type=character flags.

In word segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized. We can apply different segmentation algorithm transparently without changing pre/post processors.

In [10]:
spm.SentencePieceTrainer.train('--input=zhetian.txt --model_prefix=m_char --model_type=char --vocab_size=3439')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')

print(sp_char.encode_as_pieces('叶凡经历九龙抬棺'))
print(sp_char.encode_as_ids('叶凡经历九龙抬棺'))

['▁', '叶', '凡', '经', '历', '九', '龙', '抬', '棺']
[5, 22, 23, 151, 606, 189, 146, 1134, 520]


In [12]:
spm.SentencePieceTrainer.train('--input=zhetian.txt --model_prefix=m_word --model_type=char --vocab_size=3439')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_word.model')

print(sp_char.encode_as_pieces('叶凡经历九龙抬棺'))
print(sp_char.encode_as_ids('叶凡经历九龙抬棺'))

['▁', '叶', '凡', '经', '历', '九', '龙', '抬', '棺']
[5, 22, 23, 151, 606, 189, 146, 1134, 520]
