<a href="https://colab.research.google.com/github/zetavg/LLM-Research/blob/main/LM_Tokenizer_Traditional_Chinese_Support_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LM Tokenizer Traditional Chinese Support Comparison

Find out how the tokenizer of some language models treats Traditional Chinese.

## Prepare

In [1]:
#@markdown Install dependencies and pre-download some tokenizers.
!pip install transformers
from transformers import AutoTokenizer
AutoTokenizer.from_pretrained('EleutherAI/gpt-j-6b')
AutoTokenizer.from_pretrained('EleutherAI/pythia-70m')
AutoTokenizer.from_pretrained('bigscience/bloom')
AutoTokenizer.from_pretrained('huggyllama/llama-7b')
AutoTokenizer.from_pretrained('mosaicml/mpt-7b')
print('Done.')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Done.


### Helper function for inspecting tokenize results of CJK characters

Due to the nature of UTF-8 encoding, a single CJK character often got fragmented into multiple tokens by most tokenizers. This makes it challenging to understand how a sentence is tokenized by looking at the output. For instance:

```python
>>> tokenizer.tokenize('好')
['å¥', '½']  # A single character divided into two tokens
```

```python
>>> tokenizer.tokenize('你好世界！')
['ä½', 'ł', 'å¥', '½', 'ä¸', 'ĸ', 'çķ', 'Į', 'ï', '¼', 'ģ']
# ⬆ ???
```

To address this, here we write a function that tries to consolidate these fragmented tokens into complete CJK characters. It'll also annotates the number of tokens used to form each character, enabling us to gauge how well those CJK characters are supported.

In [2]:
def consolidate_tokens_it(tokens, tokenizer):
    processed_tokens = 0
    i = 1
    while i <= len(tokens):
        possible_tokens_to_form_full_character = tokens[processed_tokens:i]
        possible_full_character = tokenizer.convert_tokens_to_string(
            possible_tokens_to_form_full_character)
        if len(possible_full_character) > 1:
            if len(possible_tokens_to_form_full_character) > 1:
                # We got a token that should possibly belong to the next character.
                # Yield with the last token removed.
                tokens_to_form_full_character = \
                    possible_tokens_to_form_full_character[:-1]
                full_character = tokenizer.convert_tokens_to_string(
                    tokens_to_form_full_character
                )
                yield (full_character, len(tokens_to_form_full_character))
            else:
                # We only have one token, so this might be an English word.
                # Yield it and start with the next token on the next iteration.
                yield (
                    possible_full_character,
                    len(possible_tokens_to_form_full_character))
                i += 1
            # Set processed_tokens to the first token of the next character.
            processed_tokens = i - 1
        else:
            # Try to add another token to the character on the next iteration.
            i += 1
    # If we have anything left, yield it.
    remaining_tokens = tokens[processed_tokens:i]
    if remaining_tokens:
        yield (tokenizer.convert_tokens_to_string(remaining_tokens), len(remaining_tokens))


def consolidate_tokens(tokens, tokenizer):
    return list(consolidate_tokens_it(tokens, tokenizer))


# Sample usage
def print_tokenize_result(text, tokenizer):
    tokenize_result = tokenizer.tokenize(text)
    print("tokenize_result:\n", tokenize_result)
    consolidated_tokenize_result = consolidate_tokens(
        tokenize_result, tokenizer)
    print("consolidated_tokenize_result:\n", consolidated_tokenize_result)


tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-j-6b')
print_tokenize_result('你好世界！', tokenizer)


tokenize_result:
 ['ä½', 'ł', 'å¥', '½', 'ä¸', 'ĸ', 'çķ', 'Į', 'ï', '¼', 'ģ']
consolidated_tokenize_result:
 [('你', 2), ('好', 2), ('世', 2), ('界', 2), ('！', 3)]


In [3]:
#@title Define some sample text for testing
sample_sentences = [
    '網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。',
    '人工智慧（英語：artificial intelligence，縮寫為 AI），是指由人製造出來的機器所表現出來的智慧。',
    '程式設計師們越來越依賴 Git 進行版本控制、使用 Python、Ruby 或 JavaScript 等程式語言開發 Web 應用程式。',
]

In [4]:
def print_info_and_sample_results(tokenizer):
    print("Tokenizer Class:", tokenizer.__class__.__name__)
    print("Vocab Size:", tokenizer.vocab_size)
    print()
    for sample in sample_sentences:
        print("sample:", sample)
        print_tokenize_result(sample, tokenizer)
        print()

## Tokenizers

In [5]:
#@title EleutherAI/gpt-j-6b
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-j-6b')
print_info_and_sample_results(tokenizer)

Tokenizer Class: GPT2TokenizerFast
Vocab Size: 50257

sample: 網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。
tokenize_result:
 ['ç', '¶', '²', 'éļ', 'Ľ', 'ç', '¶', '²', 'è', '·', '¯', 'ï', '¼', 'Ī', 'è', 'ĭ', '±', 'èª', 'ŀ', 'ï', '¼', 'ļ', 'Internet', 'ï', '¼', 'ī', 'æĺ¯', 'æ', 'Į', 'ĩ', 'Ġ20', 'Ġ', 'ä¸', 'ĸ', 'ç', '´', 'Ģ', 'æľ', '«', 'æľ', 'Ł', 'èĪ', 'Īè', 'µ', '·', 'éĽ', '»', 'è', 'ħ', '¦', 'ç', '¶', '²', 'è', '·', '¯', 'èĪ', 'ĩ', 'éĽ', '»', 'è', 'ħ', '¦', 'ç', '¶', '²', 'è', '·', '¯', 'ä¹ĭ', 'éĸ', 'ĵ', 'æī', 'Ģ', 'ä¸', '²', 'éĢ', '£', 'æĪ', 'Ĳ', 'çļĦ', 'é¾', 'Ĳ', 'å¤§', 'ç', '¶', '²', 'è', '·', '¯', 'ç', '³', '»', 'ç', 'µ', '±', 'ãĢĤ']
consolidated_tokenize_result:
 [('網', 3), ('際', 2), ('網', 3), ('路', 3), ('（', 3), ('英', 3), ('語', 2), ('：', 3), ('Internet', 1), ('）', 3), ('是', 1), ('指', 3), (' 20', 1), (' ', 1), ('世', 2), ('紀', 3), ('末', 2), ('期', 2), ('�', 1), ('��', 1), ('�', 1), ('�', 1), ('電', 2), ('腦', 3), ('網', 3), ('路', 3), ('與', 2), ('電', 2), ('腦', 3), ('網', 3), ('路'

大多數中文字被拆成了兩個或以上的 token。

In [6]:
#@title Pythia
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/pythia-70m')
print_info_and_sample_results(tokenizer)

Tokenizer Class: GPTNeoXTokenizerFast
Vocab Size: 50254

sample: 網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。
tokenize_result:
 ['ç¶', '²', 'éļ', 'Ľ', 'ç¶', '²', 'è·¯', 'ï¼Ī', 'èĭ', '±', 'èª', 'ŀ', 'ï¼ļ', 'Internet', 'ï¼ī', 'æĺ¯', 'æĮĩ', 'Ġ20', 'Ġ', 'ä¸ĸ', 'ç´', 'Ģ', 'æľ', '«', 'æľŁ', 'èĪ', 'Ī', 'èµ·', 'éĽ', '»', 'è', 'ħ', '¦', 'ç¶', '²', 'è·¯', 'èĪ', 'ĩ', 'éĽ', '»', 'è', 'ħ', '¦', 'ç¶', '²', 'è·¯', 'ä¹ĭ', 'éĸĵ', 'æīĢ', 'ä¸', '²', 'éĢ', '£', 'æĪĲ', 'çļĦ', 'é', '¾', 'Ĳ', 'å¤§', 'ç¶', '²', 'è·¯', 'ç³»', 'çµ', '±', 'ãĢĤ']
consolidated_tokenize_result:
 [('網', 2), ('際', 2), ('網', 2), ('路', 1), ('（', 1), ('英', 2), ('語', 2), ('：', 1), ('Internet', 1), ('）', 1), ('是', 1), ('指', 1), (' 20', 1), (' ', 1), ('世', 1), ('紀', 2), ('末', 2), ('期', 1), ('興', 2), ('起', 1), ('電', 2), ('腦', 3), ('網', 2), ('路', 1), ('與', 2), ('電', 2), ('腦', 3), ('網', 2), ('路', 1), ('之', 1), ('間', 1), ('所', 1), ('串', 2), ('連', 2), ('成', 1), ('的', 1), ('龐', 3), ('大', 1), ('網', 2), ('路', 1), ('系', 1), ('統', 2), ('。', 

In [7]:
#@title BLOOM
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom')
print_info_and_sample_results(tokenizer)

Tokenizer Class: BloomTokenizerFast
Vocab Size: 250680

sample: 網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。
tokenize_result:
 ['ç¶²éļĽç¶²è·¯', 'ï¼Īèĭ±èªŀï¼ļ', 'Internet', 'ï¼īæĺ¯', 'æĮĩ', 'Ġ20', 'Ġ', 'ä¸ĸç´Ģ', 'æľ«æľŁ', 'èĪĪèµ·', 'éĽ»èħ¦', 'ç¶²è·¯', 'èĪĩ', 'éĽ»èħ¦', 'ç¶²è·¯', 'ä¹ĭéĸĵ', 'æīĢ', 'ä¸²', 'éĢ£', 'æĪĲçļĦ', 'é¾Ĳ', 'å¤§', 'ç¶²è·¯', 'ç³»çµ±', 'ãĢĤ']
consolidated_tokenize_result:
 [('網際網路', 1), ('（英語：', 1), ('Internet', 1), ('）是', 1), ('指', 1), (' 20', 1), (' ', 1), ('世紀', 1), ('末期', 1), ('興起', 1), ('電腦', 1), ('網路', 1), ('與', 1), ('電腦', 1), ('網路', 1), ('之間', 1), ('所', 1), ('串', 1), ('連', 1), ('成的', 1), ('龐', 1), ('大', 1), ('網路', 1), ('系統', 1), ('。', 1)]

sample: 人工智慧（英語：artificial intelligence，縮寫為 AI），是指由人製造出來的機器所表現出來的智慧。
tokenize_result:
 ['äººå·¥', 'æĻºæħ§', 'ï¼Īèĭ±èªŀï¼ļ', 'art', 'ificial', 'Ġintelligence', 'ï¼Į', 'ç¸®å¯«', 'çĤº', 'ĠAI', 'ï¼ī', 'ï¼Į', 'æĺ¯æĮĩ', 'çĶ±', 'äºº', 'è£½éĢł', 'åĩºä¾ĨçļĦ', 'æ©ŁåĻ¨', 'æīĢ', 'è¡¨çı¾', 'åĩºä¾ĨçļĦ', 'æĻºæħ§', 'ãĢĤ']
consolidated_t

BLOOM 用了達 25 萬、其他模型五倍之大的 vocab size ，可以

In [8]:
#@title LLaMA
tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b')
print_info_and_sample_results(tokenizer)

Tokenizer Class: LlamaTokenizerFast
Vocab Size: 32000

sample: 網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。
tokenize_result:
 ['▁', '<0xE7>', '<0xB6>', '<0xB2>', '<0xE9>', '<0x9A>', '<0x9B>', '<0xE7>', '<0xB6>', '<0xB2>', '路', '（', '英', '語', '：', 'Internet', '）', '是', '指', '▁', '2', '0', '▁', '世', '紀', '<0xE6>', '<0x9C>', '<0xAB>', '期', '<0xE8>', '<0x88>', '<0x88>', '起', '電', '<0xE8>', '<0x85>', '<0xA6>', '<0xE7>', '<0xB6>', '<0xB2>', '路', '<0xE8>', '<0x88>', '<0x87>', '電', '<0xE8>', '<0x85>', '<0xA6>', '<0xE7>', '<0xB6>', '<0xB2>', '路', '之', '間', '所', '串', '連', '成', '的', '<0xE9>', '<0xBE>', '<0x90>', '大', '<0xE7>', '<0xB6>', '<0xB2>', '路', '系', '<0xE7>', '<0xB5>', '<0xB1>', '。']
consolidated_tokenize_result:
 [('�', 2), ('�', 1), ('�', 1), ('�', 1), ('�', 1), ('�', 1), ('�', 1), ('�', 1), ('�', 1), ('路', 1), ('（', 1), ('英', 1), ('語', 1), ('：', 1), ('Internet', 1), ('）', 1), ('是', 1), ('指', 1), ('2', 2), ('0', 1), ('世', 2), ('紀', 1), ('�', 1), ('�', 1), ('�', 1), ('期', 1), ('�'

In [9]:
#@title MPT-7b
tokenizer = AutoTokenizer.from_pretrained('mosaicml/mpt-7b')
print_info_and_sample_results(tokenizer)

Tokenizer Class: GPTNeoXTokenizerFast
Vocab Size: 50254

sample: 網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。
tokenize_result:
 ['ç¶', '²', 'éļ', 'Ľ', 'ç¶', '²', 'è·¯', 'ï¼Ī', 'èĭ', '±', 'èª', 'ŀ', 'ï¼ļ', 'Internet', 'ï¼ī', 'æĺ¯', 'æĮĩ', 'Ġ20', 'Ġ', 'ä¸ĸ', 'ç´', 'Ģ', 'æľ', '«', 'æľŁ', 'èĪ', 'Ī', 'èµ·', 'éĽ', '»', 'è', 'ħ', '¦', 'ç¶', '²', 'è·¯', 'èĪ', 'ĩ', 'éĽ', '»', 'è', 'ħ', '¦', 'ç¶', '²', 'è·¯', 'ä¹ĭ', 'éĸĵ', 'æīĢ', 'ä¸', '²', 'éĢ', '£', 'æĪĲ', 'çļĦ', 'é', '¾', 'Ĳ', 'å¤§', 'ç¶', '²', 'è·¯', 'ç³»', 'çµ', '±', 'ãĢĤ']
consolidated_tokenize_result:
 [('網', 2), ('際', 2), ('網', 2), ('路', 1), ('（', 1), ('英', 2), ('語', 2), ('：', 1), ('Internet', 1), ('）', 1), ('是', 1), ('指', 1), (' 20', 1), (' ', 1), ('世', 1), ('紀', 2), ('末', 2), ('期', 1), ('興', 2), ('起', 1), ('電', 2), ('腦', 3), ('網', 2), ('路', 1), ('與', 2), ('電', 2), ('腦', 3), ('網', 2), ('路', 1), ('之', 1), ('間', 1), ('所', 1), ('串', 2), ('連', 2), ('成', 1), ('的', 1), ('龐', 3), ('大', 1), ('網', 2), ('路', 1), ('系', 1), ('統', 2), ('。', 