# Japanese Tokenization with OpenAI

This is some analysis using cl100k_base (the OpenAI embedding encoding). This byte-pair encoding is efficient for English text because it was optimized on the web dataset which includes large volumes of written English. 
This notebook is some analysis to establish a general rule of how many tokens-per-word to expect for Japanese.

In [1]:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")


## Kana
All kana are either 1 or 2 tokens, depending on how commonly they are used.

In [2]:
from collections import defaultdict

def chrs(start, end):
    return [chr(i) for i in range(start, end)]

punctuation = chrs(0x3000, 0x303f) + chrs(0x3099, 0x309f) # Not used.
hiragana = chrs(0x3041, 0x3096)
katakana = chrs(0x30a0, 0x30ff)
half_width_katakana = chrs(0xff67, 0xff9e)

all_kana = hiragana + katakana + half_width_katakana
lengths = {}
token_distributions = defaultdict(list)
for kana in all_kana:
    length = len(enc.encode(kana))
    lengths[kana] = length
    token_distributions[length].append(kana)


print(', '.join(token_distributions[1]))
print(', '.join(token_distributions[2]))
assert token_distributions[3] == []

あ, い, う, え, お, か, が, き, く, け, こ, ご, さ, ざ, し, じ, す, せ, そ, た, だ, ち, っ, つ, て, で, と, ど, な, に, の, は, ば, ま, み, め, も, や, よ, ら, り, る, れ, ろ, わ, を, ん, ア, ィ, イ, ウ, ェ, エ, オ, カ, キ, ク, グ, コ, サ, シ, ジ, ス, ズ, セ, タ, ダ, チ, ッ, テ, デ, ト, ド, ナ, ニ, バ, パ, ビ, ピ, フ, ブ, プ, ペ, ポ, マ, ム, メ, ャ, ュ, ョ, ラ, リ, ル, レ, ロ, ン, ・, ー
ぁ, ぃ, ぅ, ぇ, ぉ, ぎ, ぐ, げ, ず, ぜ, ぞ, ぢ, づ, ぬ, ね, ぱ, ひ, び, ぴ, ふ, ぶ, ぷ, へ, べ, ぺ, ほ, ぼ, ぽ, む, ゃ, ゅ, ゆ, ょ, ゎ, ゐ, ゑ, ゔ, ゕ, ゠, ァ, ゥ, ォ, ガ, ギ, ケ, ゲ, ゴ, ザ, ゼ, ソ, ゾ, ヂ, ツ, ヅ, ヌ, ネ, ノ, ハ, ヒ, ヘ, ベ, ホ, ボ, ミ, モ, ヤ, ユ, ヨ, ヮ, ワ, ヰ, ヱ, ヲ, ヴ, ヵ, ヶ, ヷ, ヸ, ヹ, ヺ, ヽ, ヾ, ｧ, ｨ, ｩ, ｪ, ｫ, ｬ, ｭ, ｮ, ｯ, ｰ, ｱ, ｲ, ｳ, ｴ, ｵ, ｶ, ｷ, ｸ, ｹ, ｺ, ｻ, ｼ, ｽ, ｾ, ｿ, ﾀ, ﾁ, ﾂ, ﾃ, ﾄ, ﾅ, ﾆ, ﾇ, ﾈ, ﾉ, ﾊ, ﾋ, ﾌ, ﾍ, ﾎ, ﾏ, ﾐ, ﾑ, ﾒ, ﾓ, ﾔ, ﾕ, ﾖ, ﾗ, ﾘ, ﾙ, ﾚ, ﾛ, ﾜ, ﾝ


In [3]:
pairs = [
    "ちょ",
    "っつ",
    "びゃ",
]
for pair in pairs:
    print(pair, len(enc.encode(pair)), len(pair))

ちょ 3 2
っつ 2 2
びゃ 4 2


## Common sentence endings
Some common endings to sentences are a single token, including past-tense verbs (mashita). 

In [4]:
# Common pairs of full-width hiragana
pairs = (
    "ます",
    "です",
    "ました",
    "でした",
    "ない"
)
for pair in pairs:
    print(pair, len(enc.encode(pair)), len(pair))

ます 1 2
です 1 2
ました 1 3
でした 2 3
ない 2 2


## Particles
All common particles are 1 token

In [5]:
particles = "はにでがな"
for particle in particles:
    print(particle, len(enc.encode(particle)))

は 1
に 1
で 1
が 1
な 1


## Ideographs and Kanji

Kanji are 1,2 or 3 tokens.

In [6]:
ideographs = [chr(i) for i in range(0x4e00, 0x9faf)]

ideograph_token_distributions = defaultdict(list)
for kanji in ideographs:
    length = len(enc.encode(kanji))
    token_distributions[length].append(kanji)


print(len(token_distributions[1]))
print(len(token_distributions[2]))
print(len(token_distributions[3]))
assert token_distributions[4] == []

647
12516
7983
