<a href="https://colab.research.google.com/github/zetavg/LLM-Research/blob/main/Transformers_Tokenizer_For_Training_Study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers Tokenizer (For Training) Study

Analyze the use of the tokenizer in training some open-source LLMs, and experiment with training a new tokenizer that performs better on CJK characters (Traditional Chinese).

In [1]:
!pip install transformers==4.28.0 sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# decapoda-research/llama-7b-hf seems to have strange tokenizer config
llama_model_name = 'huggyllama/llama-7b'

## Basic: How Tokenizers Work

In [None]:
# @markdown Load the GPT-2 Tokenizer

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True)

In [None]:
# @markdown This tokenizer has a few special symbols, like Ġ and Ċ, which denote spaces and newlines, respectively.

tokenizer.tokenize("Hello, world!\nNice to meet you!")

['Hello', ',', 'Ġworld', '!', 'Ċ', 'Nice', 'Ġto', 'Ġmeet', 'Ġyou', '!']

In [None]:
# @markdown Load the LLaMA Tokenizer

from transformers import LlamaTokenizer
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model_name)
llama_tokenizer

LlamaTokenizer(name_or_path='huggyllama/llama-7b', vocab_size=32000, model_max_length=2048, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=False)

In [None]:
# @markdown The LLaMA tokenizer handles special symbols differently.

llama_tokenizer.tokenize("Hello, world!\nNice to meet you!")

['▁Hello',
 ',',
 '▁world',
 '!',
 '<0x0A>',
 'N',
 'ice',
 '▁to',
 '▁meet',
 '▁you',
 '!']

In [None]:
# @markdown The main method to tokenize. ([Docs](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__))
# @markdown
# @markdown It returns a [BatchEncoding](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__.returns) which has the following fields:
# @markdown
# @markdown * `input_ids` — List of token ids to be fed to a model. They are token indices, numerical representations of tokens.
# @markdown * `token_type_ids` — List of token type ids to be fed to a model (when `return_token_type_ids=True` or if “token_type_ids” is in self.model_input_names). These are used by models that delas with classification on pairs of sentences or question answering ([See](https://huggingface.co/docs/transformers/glossary#token-type-ids)).
# @markdown * `attention_mask` — List of indices specifying which tokens should be attended to by the model (when `return_attention_mask=True` or if “attention_mask” is in self.model_input_names) ([See](https://huggingface.co/docs/transformers/glossary#attention-mask)).
# @markdown * `overflowing_tokens` — List of overflowing tokens sequences (when a max_length is specified and `return_overflowing_tokens=True`).
# @markdown * `num_truncated_tokens` — num_truncated_tokens — Number of tokens truncated (when a max_length is specified and `return_overflowing_tokens=True`).
# @markdown * `special_tokens_mask` — List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when `add_special_tokens=True` and `return_special_tokens_mask=True`).
# @markdown * `length` — The length of the inputs (when return_length=True)

print("Let's see the default behaiver:")
print("GPT-2 tokenizer: ", tokenizer("Hello, world!"))
print("LLaMA tokenizer: ", llama_tokenizer("Hello, world!"))
print("")

print("Now, setting 'max_length=4', 'truncation=True, 'return_overflowing_tokens=True' and 'return_length=True':")
print(
    "GPT-2 tokenizer: ",
    tokenizer(
        "Hello, world!",
        max_length=4, truncation=True, return_overflowing_tokens=True))
print(
    "LLaMA tokenizer: ",
    llama_tokenizer(
        "Hello, world!",
        max_length=4, truncation=True, return_overflowing_tokens=True))


Let's see the default behaiver:
GPT-2 tokenizer:  {'input_ids': [15496, 11, 995, 0], 'attention_mask': [1, 1, 1, 1]}
LLaMA tokenizer:  {'input_ids': [1, 15043, 29892, 3186, 29991], 'attention_mask': [1, 1, 1, 1, 1]}

Now, setting 'max_length=4', 'truncation=True, 'return_overflowing_tokens=True' and 'return_length=True':
GPT-2 tokenizer:  {'input_ids': [[15496, 11, 995, 0]], 'attention_mask': [[1, 1, 1, 1]], 'overflow_to_sample_mapping': [0]}
LLaMA tokenizer:  {'overflowing_tokens': [29991], 'num_truncated_tokens': 1, 'input_ids': [1, 15043, 29892, 3186], 'attention_mask': [1, 1, 1, 1]}


## Tokanizers in Training

Here, we analyze how the tokenizers participated in the training of some open-source language models.

### Alpaca-LoRA

From https://github.com/tloen/alpaca-lora/blob/65fb8225c09af81feb5edb1abb12560f02930703/finetune.py.

#### Sample data and default values

For simulating the dataset and hyperparams on training.

In [None]:
cutoff_len = 512

sample_aplaca_lora_user_prompt = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Find the capital of Spain.

### Response:
"""

sample_aplaca_lora_full_prompt = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Find the capital of Spain.

### Response:
The capital of Spain is Madrid.
"""

#### Setup the tokenizer

Code below are copied from https://github.com/tloen/alpaca-lora/blob/65fb8225/finetune.py#L126.


In [None]:
alpaca_lora_tokenizer = LlamaTokenizer.from_pretrained(llama_model_name)

In [None]:
# @markdown Not sure why this is needed. The default seems to be `None`.

alpaca_lora_tokenizer.pad_token_id = (
    0  # unk. we want this to be different from the eos token
)

In [None]:
# @markdown Not sure why this is needed. The default seems to be `'right'`.

alpaca_lora_tokenizer.padding_side = "left"  # Allow batched inference

In [None]:
# @markdown Not sure why this is all 0, it seems to be 1, 2, 0 regarding to the [generation script](https://github.com/tloen/alpaca-lora/blob/630d114/generate.py#L76-L78). <br />
# @markdown Update: using `huggyllama/llama-7b` instead of `decapoda-research/llama-7b-hf` fixes this.

print(alpaca_lora_tokenizer.bos_token_id)
print(alpaca_lora_tokenizer.eos_token_id)
print(alpaca_lora_tokenizer.unk_token_id)

1
2
0


#### How the tokenizer works in alpaca-lora fine-tuning source code

In [None]:
# @markdown From https://github.com/tloen/alpaca-lora/blob/65fb8225/finetune.py#L126.<br />
# @markdown * `truncation=True` and `max_length=cutoff_len` is being set to ensure the tokenized prompt is no longer than the cutoff length.<br />
# @markdown * I think `padding=False` is set because the padding will be done by the [`transformers.DataCollatorForSeq2Seq` Data Collactor](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq), which is passed to the trainer later and will dynamically pad the inputs received, as well as the labels.
# @markdown * `return_tensors=None` is set also because `transformers.DataCollatorForSeq2Seq` is in charge of handling this later, alough `None` seems to be also the default value making this redundant. (Set it to 'pt' will let the tokenizer return PyTorch torch.Tensor objects.)
# @markdown * As the comment, "there's probably a way to do this with the tokenizer settings", it seems that using the [`TemplateProcessing` post-processor](https://huggingface.co/docs/tokenizers/v0.13.3/en/api/post-processors#tokenizers.processors.TemplateProcessing) can achieve this (adding the eos token at the end of input), but I can't find a way to add [post-processors](https://huggingface.co/docs/tokenizers/v0.13.3/en/pipeline) to LLaMA tokenizer or GPT tokenizer. Maybe it's not the way of using TemplateProcessing. So I guess it make sense to do this manually. 

def tokenize(prompt, add_eos_token=True):
    # there's probably a way to do this with the tokenizer settings
    # but again, gotta move fast
    result = alpaca_lora_tokenizer(
        prompt,
        truncation=True,
        max_length=cutoff_len,
        padding=False,
        return_tensors=None,
    )
    if (
        result["input_ids"][-1] != alpaca_lora_tokenizer.eos_token_id
        and len(result["input_ids"]) < cutoff_len
        and add_eos_token
    ):
        result["input_ids"].append(alpaca_lora_tokenizer.eos_token_id)
        result["attention_mask"].append(1)

    result["labels"] = result["input_ids"].copy()

    return result

# Test it
tokenize(sample_aplaca_lora_full_prompt)

# Test if long input will be truncated
# tokenize(" a" * (cutoff_len + 10))


{'input_ids': [1, 29871, 13, 21140, 340, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 12542, 278, 7483, 310, 13616, 29889, 13, 13, 2277, 29937, 13291, 29901, 13, 1576, 7483, 310, 13616, 338, 9669, 29889, 13, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [1, 29871, 13, 21140, 340, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 12542, 278, 7483, 310, 13616, 29889, 13, 13, 2277, 29937, 13291, 29901, 13, 1576, 7483, 310, 13616, 338, 9669, 29889, 13, 2]}

In [None]:
# @markdown From https://github.com/tloen/alpaca-lora/blob/65fb8225/finetune.py#L148. The arguments are modified so that sample data can be passed into it directly without using the prompter to generate a prompt.<br /><br />
# @markdown The main purpose of this function (other then calling `prompter.generate_prompt`) is to deal with `train_on_inputs`, if `train_on_inputs` is set to False, then the ground truth that is used to calcualte the loss, which is `labels`, will be modified so that the "user input" tokens will be relpaced by `-100`.
# @markdown > <p>In most cases, the `IGNORE_TOKEN_ID` is set to -100. This value is chosen because the PyTorch implementation of common loss functions, such as CrossEntropyLoss, is designed to ignore targets with the value -100 during loss calculation. When the loss function encounters a target with this value, it doesn't contribute to the loss, effectively making the model ignore the corresponding input tokens during training.
# @markdown > </p> -- GPT-4

def generate_and_tokenize_prompt(user_prompt, full_prompt):
    # full_prompt = prompter.generate_prompt(
    #     data_point["instruction"],
    #     data_point["input"],
    #     data_point["output"],
    # )
    tokenized_full_prompt = tokenize(full_prompt)
    if not train_on_inputs:
        # user_prompt = prompter.generate_prompt(
        #     data_point["instruction"], data_point["input"]
        # )
        tokenized_user_prompt = tokenize(
            user_prompt, add_eos_token=add_eos_token
        )
        user_prompt_len = len(tokenized_user_prompt["input_ids"])

        if add_eos_token:
            user_prompt_len -= 1

        tokenized_full_prompt["labels"] = [
            -100
        ] * user_prompt_len + tokenized_full_prompt["labels"][
            user_prompt_len:
        ]  # could be sped up, probably
    return tokenized_full_prompt

# Test it. Here the begining of 'labels' should be "masked" with a lot of "-100".
add_eos_token = True
train_on_inputs = False
generate_and_tokenize_prompt(sample_aplaca_lora_user_prompt, sample_aplaca_lora_full_prompt)

{'input_ids': [1, 29871, 13, 21140, 340, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 12542, 278, 7483, 310, 13616, 29889, 13, 13, 2277, 29937, 13291, 29901, 13, 1576, 7483, 310, 13616, 338, 9669, 29889, 13, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1576, 7483, 310, 13616, 338, 9669, 29889, 13, 2]}

In [None]:
# @markdown In the [source code of alpaca-lora](https://github.com/tloen/alpaca-lora/blob/65fb8225/finetune.py#L257-L259), the trainer is being passed a `transformers.DataCollatorForSeq2Seq` data collector as in this code cell.
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    alpaca_lora_tokenizer,
    pad_to_multiple_of=8,
    return_tensors="pt",
    padding=True
)

In [None]:
# @markdown Not sure why but we will get an error
# @markdown > ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
# @markdown
# @markdown while using the data collector directly ff we don't add the `pad_token` with `alpaca_lora_tokenizer.add_special_tokens`. This part is not seen in the source code of alpaca-lora. I guess there're some magic happening as the trainer uses the data collector, or the version of the transformers lib.
alpaca_lora_tokenizer.add_special_tokens({'pad_token': '<unk>'})

# Check
print(alpaca_lora_tokenizer.special_tokens_map)
print(alpaca_lora_tokenizer.pad_token_id)

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}
0


In [None]:
# @markdown Now, check if the inputs in the same batch (?) will be padded to have the same length.

batch_encoding_1 = generate_and_tokenize_prompt(
    sample_aplaca_lora_user_prompt,
    sample_aplaca_lora_full_prompt
)
batch_encoding_2 = generate_and_tokenize_prompt(
    'Hi',
    'Hi there!')
features = data_collator([batch_encoding_1, batch_encoding_2])

# @markdown In the output, we can see that the shorter input has been padded with lots of `0`s. The pad token is inserted from the left because we have `alpaca_lora_tokenizer.padding_side = "left"`. Also, the attention masks has also been set correctly to mask out the pad tokens.
print(features)

{'input_ids': tensor([[    0,     0,     1, 29871,    13, 21140,   340,   338,   385, 15278,
           393, 16612,   263,  3414, 29889, 14350,   263,  2933,   393,  7128,
          2486,  1614,  2167,   278,  2009, 29889,    13,    13,  2277, 29937,
          2799,  4080, 29901,    13, 12542,   278,  7483,   310, 13616, 29889,
            13,    13,  2277, 29937, 13291, 29901,    13,  1576,  7483,   310,
         13616,   338,  9669, 29889,    13,     2],
        [    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     1,  6324,   727, 29991,     2]]), 'attention_mask': tensor([[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
# @markdown Now, we try to decode it back.
print("features['input_ids'][0]: ")
print("---")
print(alpaca_lora_tokenizer.decode(features['input_ids'][0]))
print("---")
print("")

# @markdown `features['labels'][0]` can't be decoded directly because those `-100`s will make the tokenizer return an `IndexError: piece id is out of range.`
# @markdown Therefore, we map them into `0`s (`<unk>`) before decoding.
print("features['labels'][0]: ")
print("---")
print(alpaca_lora_tokenizer.decode([
    0 if id < 0 else id for id in features['labels'][0]
]))
print("---")

features['input_ids'][0]: 
---
<unk><unk><s> 
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Find the capital of Spain.

### Response:
The capital of Spain is Madrid.
</s>
---

features['labels'][0]: 
---
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk> The capital of Spain is Madrid.
</s>
---


## Experiment On Training Tokenizers


### Tool for inspecting tokenize results for CJK characters

Since CJK characters often got splitted into multiple tokens, it's hard to see how a sentence is being tokenized by the tokenized result. For example:

```python
>>> tokenizer.tokenize('好')
['å¥', '½']  # Two tokens for a word
```

```python
>>> tokenizer.tokenize('你好世界！')
['ä½', 'ł', 'å¥', '½', 'ä¸', 'ĸ', 'çķ', 'Į', 'ï', '¼', 'ģ']
# ⬆️ ???
```

To address this, we define a function that can return full CJK words of a tokenization result, and also how many tokens are used to form that word:

In [None]:
def tokenize_cjk(tokenizer, text):
    tokens = tokenizer.tokenize(text)
    processed_tokens = 0
    i = 1
    tokens_to_form_full_word = []
    full_word = ""
    while i <= len(tokens):
        test_tokens_to_form_full_word = tokens[processed_tokens:i]
        test_full_word = tokenizer.convert_tokens_to_string(test_tokens_to_form_full_word)
        if len(test_full_word) > 1:
            if full_word:
                # We got tokens that should belong to the next word.
                # Yield the previous full word and reset the list.
                yield (full_word, len(tokens_to_form_full_word))
            else:
                # We do not have a previous word, so this might be an English word. Yield it.
                yield (test_full_word, len(test_tokens_to_form_full_word))
                i += 1
            # Reset the list of tokens to form a full word.
            tokens_to_form_full_word = []
            full_word = ""
            # Set processed_tokens to the first token of the next word.
            processed_tokens = i - 1
        else:
            tokens_to_form_full_word = test_tokens_to_form_full_word
            full_word = test_full_word
            i += 1  # Try to add another token to the word on the next iteration.
    # If we have anything left, yield it.
    if full_word:
        yield (full_word, len(tokens_to_form_full_word))

In [None]:
# @markdown Here we got `[('你', 2), ('好', 2), ('，', 3), ('Anna', 1), ('！', 3)]`. It means that thw word "你" is formed by 2 tokens, "好" by 2 tokens and "，" by 3 tokens. "Anna" is formed by 1 token.

tokenizer = \
    AutoTokenizer.from_pretrained("gpt2")
list(tokenize_cjk(tokenizer, "你好，Anna！"))

[('你', 2), ('好', 2), ('，', 3), ('Anna', 1), ('！', 3)]

### Prepare Train Data

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import itertools
import json
from datasets import load_dataset
# @markdown Load the Wikipedia dataset.
wikipedia_ds = load_dataset(
    # "zetavg/wikipedia_random_page_summaries_zh_tw_5k"
    "zetavg/wikipedia_random_page_summaries_zh_tw_10k"
    # "zetavg/wikipedia_random_page_summaries_zh_tw_100k"
)['train']

wikipedia_ds_count = len(wikipedia_ds)
print("Wikipedia data count: ", wikipedia_ds_count)

def get_wikipedia_page_summaries():
    for batch in wikipedia_ds:
        yield batch['page_summary']

# @markdown Preview it.
first_10_items = list(itertools.islice(
    get_wikipedia_page_summaries(), 10))
print(json.dumps(
    first_10_items, indent=2, ensure_ascii=False))

Downloading readme:   0%|          | 0.00/658 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/zetavg___parquet/zetavg--wikipedia_random_page_summaries_zh_tw_10k-bf9678c22fef2612/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.93M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/9997 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/zetavg___parquet/zetavg--wikipedia_random_page_summaries_zh_tw_10k-bf9678c22fef2612/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Wikipedia data count:  9997
[
  "TCI（The Children Investment Fund Management），是英國投資家 Chris Hohn 創立的對沖基金。",
  "高元縣，中國舊縣名。\n1958 年 5 月由高邑縣、元氏縣兩縣合併設置高元縣，治所在今河北省元氏縣槐陽鎮。同年 12 月高元縣撤銷，恢復高邑縣、元氏縣。",
  "小行星 9649（英語：9649）是一顆圍繞太陽公轉的小行星。1995 年 12 月 2 日，小林隆男在大泉天文台發現了此天體。\n這顆小行星的絕對星等為 326.1587678462532 等。",
  "紀堯姆・伊曼紐爾・德霍曼 - 克里斯托（法語：Guillaume Emmanuel de Homem-Christo，法語發音：[ɡi manɥɛl də ɔmɛm kʁisto]，1974 年 2 月 8 日 —）或稱作蓋 - 馬努爾・德霍曼 - 克里斯托（法語：Guy-Manuel de Homem-Christo），是一名法國男音樂家、音樂製作人、歌手、詞曲作家、DJ、作曲家和導演，較著名的是和湯瑪斯・本高特共同組成的樂團傻瓜龐克，這使自己和其作品都獲得了廣大的知名度。他和埃里克・謝德維爾（Eric Chédeville）為唱片公司 Crydamoure 的共同創立者，且也和切德維爾組成了名為 Le Knight Club 的團體。\n德霍曼 - 克里斯托擁有兩名孩子。",
  "NGC 1026 是鯨魚座的一個星系。",
  "蓖麻毒蛋白（英語：Ricin）是從蓖麻籽中所萃取出來的一種毒性蛋白質，幾乎對所有的真核細胞都具有殺傷作用。蓖麻毒蛋白的純品是一種白色粉末或結晶體，無味，可溶於稀酸或鹽類，不溶於苯、甲苯、乙醇、乙醚、三氯甲烷等有機溶劑，乾熱時具有良好的穩定性。蓖麻毒蛋白存在多種類型，如結晶型、B - 型、D 型、E 型、T3 型、G 型等，不同類型的蓖麻毒蛋白毒性不盡相同，其中以 D 型的毒性最大。\n此種毒素對人類的平均致死量為 0.2 毫克，但也有一些文獻記載的劑量較高。蓖麻毒蛋白具有糖苷酶活性，作用於真核細胞的核糖體 RNA，使其降解，從而阻止蛋白質合成，導致細胞的死亡，進而對生物體造成傷害。研究顯示，8 顆蓖麻種子的毒素可對一名成人產生毒性。不過在已知紀錄中，因

In [None]:
# @markdown Load vocab from moedict.

import requests
import json
import pandas as pd
from google.colab import data_table
data_table.enable_dataframe_formatter()

moedict_data_cat_response = requests.get(
    "https://raw.githubusercontent.com/g0v/moedict-data/master/dict-cat.json")
moedict_data_cat = json.loads(moedict_data_cat_response.text)
# @markdown Display a table of categories and their sample entries.
cat_and_sample_entries = [
    {
        'name': cat['name'],
     'entries_count': len(cat['entries']),
        'sample_entries': (", ").join(cat['entries'][:10]),
    } for cat in moedict_data_cat]
display(
    pd.DataFrame.from_dict(cat_and_sample_entries, orient='columns'))

Unnamed: 0,name,entries_count,sample_entries
0,成語,3008,"八千里路雲和月, 八子七婿, 拔本塞原, 拔茅連茹, 拔來報往, 拔葵去織, 拔薤, 拔幟易..."
1,諺語,891,"八輩兒五沒根基, 八竿子打不著, 八棍子撂不著, 拔了蘿蔔地皮寬, 撥火又長，拄門又短, 鵓..."
2,歇後語,323,"八仙桌上擺夜壺, 八十年不下雨, 八十歲學吹鼓手, 被窩裡放屁, 包黑臉斷案子, 斑鳩跌彈,..."
3,音譯,484,"扒魯, 巴, 巴波亞, 巴力門, 巴剎, 巴士, 巴茲卡, 巴爾, 吧, 芭蕾舞"
4,義譯,112,"白皮書, 白領, 白領階級, 筆記型電腦, 配接卡, 魔術數字, 模擬器, 媒體, 免費軟體..."
...,...,...,...
61,股票術語,96,"拔檔, 寶塔線, 本利比, 本益比, 崩盤, 丙種經紀人, 補空, 盤面, 盤整, 騙線"
62,大陸用語,633,"芭賽, 把口, 把場, 撥改貸, 掰了, 白班, 白條, 白案, 白衣教練, 白煙"
63,量詞,434,"巴, 巴爾, 撥, 波, 杯, 盃, 倍, 輩, 包, 抱"
64,節氣,34,"八節, 白露, 芒種, 大寒, 大雪, 大暑, 冬至, 太陰曆, 農曆, 立冬"


Select only some categories from moedict data so that we won't have too many words:


In [None]:
selected_categories = [
    '量詞',
    # '成語',
    '音譯', '義譯', '音義合譯',
    '動物名', '植物名',
    # '微生物',
    '稱謂', '職官名',
    # '節氣',
    # '節日',
    '國名', 
    # '朝代名', '帝號',
    '人名',  
    # '種族、民族',
    '地名', '州名', '省名', '城市名', '縣名', '鄉鎮名', '郡名', '島名', '半島名', '群島名',
    '方位名', '山名', '山脈名', '山峰名', '河川名', '湖泊名', '海洋名', '海峽名', '海灣名', '運河名', '水庫名',
    # '星名',
    '星座名',
    # '書名', '書體名', '文體名',
    # '詩名', '詞牌名', '曲牌名', 
    '樂曲名', '樂器名', 
    # '戲劇曲藝', '雜劇', '傳奇',
    '舞曲舞蹈', '球類',
    # '神話',
    # '武器名',
    # '病名',
    # '股票術語',
]
filtered_moedict_data_cat = [
    cat for cat in moedict_data_cat
    if cat['name'] in selected_categories]
print("Selected categories: ", len(filtered_moedict_data_cat))
filtered_moedict_entries = [
    entry
    for cat in filtered_moedict_data_cat
    for entry in cat['entries']]
print("Entries: ", len(filtered_moedict_entries))

Selected categories:  36
Entries:  10412


In [None]:
# @markdown Mix moedict entries and random wikipedia page summaries for training.
import itertools
import random

word_connectors = ["和", "的", "及", "以及", "與", "或", "或者", "跟", "既", "又", "還", "還有",
                   "而", "而且", "同", "並", "並且", "即", "就", "總之", "因此", "如", "若", "若是", "假若", "假如"]
# cycling_word_connectors = itertools.cycle(word_connectors)

def get_training_words_text():
    # return ("").join([
    #     f"{word}{connector}" for word, connector
    #     in list(
    #         zip(filtered_moedict_entries, cycling_word_connectors)
    #     )])
    return ("").join([
        f"{word}{random.choice(word_connectors)}" for word
        in filtered_moedict_entries])

print("Sample training words text: ", get_training_words_text()[:100])
print("Sample wikipedia page_summary: ", wikipedia_ds[0]['page_summary'])

# word_text_iterations = int(wikipedia_ds_count / 20)
word_text_iterations = 100

def get_training_corpus():
    for _ in range(word_text_iterations):
        yield get_training_words_text()
    for batch in wikipedia_ds:
        yield batch['page_summary']

training_corpus_count = word_text_iterations + wikipedia_ds_count
print("Training corpus count: ", training_corpus_count)

Sample training words text:  扒魯即巴或巴波亞或者巴力門或巴剎或者巴士與巴茲卡還巴爾同吧同芭蕾舞即波特酒若是波雷羅舞曲又波羅蜜跟波羅提木叉以及波羅夷或者波爾卡舞而般若的泊車總之柏青哥假若勃露斯假若白蘭地與百靈舌假如百事可樂又拜拜的
Sample wikipedia page_summary:  TCI（The Children Investment Fund Management），是英國投資家 Chris Hohn 創立的對沖基金。
Training corpus count:  10097


### Prepare Tokenizer and Train

In [None]:
# @markdown Prepare tokenizer for training
old_tokenizer = \
    AutoTokenizer.from_pretrained('EleutherAI/gpt-j-6b')
old_tokenizer

GPT2TokenizerFast(name_or_path='EleutherAI/gpt-j-6b', vocab_size=50257, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=True)

In [None]:
# @markdown Check how the tokenizer behaves.
import json
example_text = "網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。"
# @markdown This is not an efficient tokenizer for Chinese words, as most of the word are broken into 2 ~ 3 tokens. The phrases "網際網路" occupies 11 tokens in total.
print(list(tokenize_cjk(old_tokenizer, example_text)))

[('網', 3), ('際', 2), ('網', 3), ('路', 3), ('（', 3), ('英', 3), ('語', 2), ('：', 3), ('Internet', 1), ('）', 3), ('是', 1), ('指', 3), (' 20', 1), (' ', 1), ('世', 2), ('紀', 3), ('末', 2), ('期', 2), ('�', 1), ('��', 1), ('�', 1), ('�', 1), ('電', 2), ('腦', 3), ('網', 3), ('路', 3), ('與', 2), ('電', 2), ('腦', 3), ('網', 3), ('路', 3), ('之', 1), ('間', 2), ('所', 2), ('串', 2), ('連', 2), ('成', 2), ('的', 1), ('龐', 2), ('大', 1), ('網', 3), ('路', 3), ('系', 3), ('統', 3), ('。', 1)]


In [None]:
print("old_tokenizer.vocab_size: ", old_tokenizer.vocab_size)

old_tokenizer.vocab_size:  50257


In [None]:
vocabs_to_add = 20000

# @markdown Train a new tokenizer with `zh-tw` corpus.
new_trained_tokenizer = old_tokenizer.train_new_from_iterator(
    get_training_corpus(), 
    vocab_size=vocabs_to_add,
    length=training_corpus_count
)

In [None]:
# @markdown Merge the new trained tokenizer with the original tokenizer to form a new tokenizer.
import os
import shutil

old_tokenizer.save_pretrained('/tmp/old_tokenizer')
new_trained_tokenizer.save_pretrained('/tmp/new_trained_tokenizer')
os.makedirs("/tmp/merged_tokenizer", exist_ok=True)

shutil.copy("/tmp/old_tokenizer/tokenizer_config.json", "/tmp/merged_tokenizer/tokenizer_config.json")
shutil.copy("/tmp/old_tokenizer/special_tokens_map.json", "/tmp/merged_tokenizer/special_tokens_map.json")
# ValueError: Non-consecutive added token '<|extratoken_1|>' found. Should have index 51693 but has index 50257 in saved vocabulary.
# shutil.copy("/tmp/old_tokenizer/added_tokens.json", "/tmp/merged_tokenizer/added_tokens.json")

vocab = json.load(open('/tmp/old_tokenizer/vocab.json'))
new_vocab = json.load(open('/tmp/new_trained_tokenizer/vocab.json'))
idx = max(list(vocab.values())) + 1
for word in new_vocab.keys():
    if word not in vocab.keys():
        vocab[word] = idx
        idx += 1

with open('/tmp/merged_tokenizer/vocab.json', 'w') as f:
    json.dump(vocab, f, ensure_ascii=False)
print("New vocab size: ", len(vocab.values()))

with open('/tmp/old_tokenizer/merges.txt', 'r') as original_merges,\
     open('/tmp/new_trained_tokenizer/merges.txt', 'r') as new_merges,\
     open('/tmp/merged_tokenizer/merges.txt', 'w') as output_merges:

    output_merges.write(original_merges.read())
    lines = new_merges.readlines()[1:]
    output_merges.writelines(lines)

new_tokenizer = AutoTokenizer.from_pretrained('/tmp/merged_tokenizer')
new_tokenizer

New vocab size:  68965


GPT2TokenizerFast(name_or_path='/tmp/merged_tokenizer', vocab_size=68965, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=True)

### Results

In [None]:
import pandas as pd

print("Old vocab size: ", old_tokenizer.vocab_size)
print("New vocab size: ", new_tokenizer.vocab_size)

old_vocab = old_tokenizer.vocab
new_vocab = new_tokenizer.vocab
modified_vocab_table = []
for key, value in old_vocab.items():
    if key not in new_vocab:
        modified_vocab_table.append([key, value, None])
    elif value != new_vocab[key]:
        modified_vocab_table.append([key, value, new_vocab[key]])
modified_vocab_count = len(modified_vocab_table)
if modified_vocab_count:
    print(f"{modified_vocab_count} modified vocabs.")
    # print(json.dumps(modified_vocab_table, indent=2))
    df = pd.DataFrame(modified_vocab_table, columns=['Vocab', 'Original ID', 'New ID'])
    display(df)

Old vocab size:  50257
New vocab size:  68965
143 modified vocabs.


Unnamed: 0,Vocab,Original ID,New ID
0,<|extratoken_122|>,50378,
1,<|extratoken_77|>,50333,
2,<|extratoken_17|>,50273,
3,<|extratoken_43|>,50299,
4,<|extratoken_32|>,50288,
...,...,...,...
138,<|extratoken_86|>,50342,
139,<|extratoken_114|>,50370,
140,<|extratoken_128|>,50384,
141,<|extratoken_84|>,50340,


In [None]:
# @markdown Compare old and new trained tokenizer. After reading thousands of zh-tw wiki summaries, the trained tokenizer performs much better.

print("new_trained_tokenizer.vocab_size: ", new_trained_tokenizer.vocab_size)
print("example_text: ", example_text)
print("old: ",
      list(tokenize_cjk(old_tokenizer, example_text)))
print("new: ",
      list(tokenize_cjk(new_tokenizer, example_text)))

new_trained_tokenizer.vocab_size:  20000
example_text:  網際網路（英語：Internet）是指 20 世紀末期興起電腦網路與電腦網路之間所串連成的龐大網路系統。
old:  [('網', 3), ('際', 2), ('網', 3), ('路', 3), ('（', 3), ('英', 3), ('語', 2), ('：', 3), ('Internet', 1), ('）', 3), ('是', 1), ('指', 3), (' 20', 1), (' ', 1), ('世', 2), ('紀', 3), ('末', 2), ('期', 2), ('�', 1), ('��', 1), ('�', 1), ('�', 1), ('電', 2), ('腦', 3), ('網', 3), ('路', 3), ('與', 2), ('電', 2), ('腦', 3), ('網', 3), ('路', 3), ('之', 1), ('間', 2), ('所', 2), ('串', 2), ('連', 2), ('成', 2), ('的', 1), ('龐', 2), ('大', 1), ('網', 3), ('路', 3), ('系', 3), ('統', 3), ('。', 1)]
new:  [('網際網路', 1), ('（', 1), ('英語', 1), ('：', 1), ('Internet', 1), ('）', 1), ('是指', 1), (' 20', 1), (' 世紀', 1), ('末期', 1), ('�', 1), ('��', 1), ('�', 1), ('�', 1), ('電腦', 1), ('網路', 1), ('與', 1), ('電腦', 1), ('網路', 1), ('之間', 1), ('所', 1), ('串', 1), ('連', 1), ('成的', 1), ('龐', 1), ('大', 1), ('網路', 1), ('系統', 1), ('。', 1)]


In [None]:
# @markdown Compare old and new trained tokenizer. After reading thousands of zh-tw wiki summaries, the trained tokenizer performs much better.

sample_text_list = [
    "人工智慧是電腦科學、心理學、哲學等學科融合的跨領域學科。",
    "高雄市充滿藝術氣息與海港風情，擁有獨具特色的駁二藝術特區、充滿藝術氛圍的美術館、現代化的高雄流行音樂中心、大型展覽館與會議中心、以及壯觀的高雄港等多元旅遊景點。透過便捷的輕軌與大眾運輸工具，到訪的旅客可以輕鬆地往返這些地點，體驗高雄豐富的文化與歷史並享受美好的時光。"
    "程式設計師們越來越依賴 Git 進行版本控制、使用 Python、Java 或 JavaScript 等程式語言開發 Web 應用程式，還需要在 Linux 或 Windows 作業系統上操作，並熟悉各種資料庫系統如 MySQL、MongoDB 和 PostgreSQL，以及對 API、RESTful 架構和 Docker 容器化技術有深入了解，這都是為了追求在軟體開發領域的卓越表現。",
    "在機器學習領域，研究人員利用各種算法如 SVM、Random Forest 和 Neural Networks 來分析大量數據，並對應用如自然語言處理（NLP）、圖像識別（Image Recognition）以及強化學習（Reinforcement Learning）進行深入研究，同時，他們也需要掌握 TensorFlow、PyTorch 等深度學習框架，以實現更為高效、準確的模型訓練和預測，以期在人工智能（AI）領域取得突破性的成果。",
    "過幾天天天天氣不好。",
]

for text in sample_text_list:
    print("sample_text: ", text)
    print("old: ",
          list(tokenize_cjk(old_tokenizer, text)))
    print("new: ",
          list(tokenize_cjk(new_tokenizer, text)))
    print()

sample_text:  人工智慧是電腦科學、心理學、哲學等學科融合的跨領域學科。
old:  [('人', 1), ('工', 2), ('智', 3), ('慧', 3), ('是', 1), ('電', 2), ('腦', 3), ('科', 3), ('學', 2), ('、', 1), ('心', 2), ('理', 2), ('學', 2), ('、', 1), ('哲', 3), ('學', 2), ('等', 3), ('學', 2), ('科', 3), ('融', 3), ('合', 2), ('的', 1), ('跨', 3), ('領', 3), ('域', 3), ('學', 2), ('科', 3), ('。', 1)]
new:  [('人工智慧', 1), ('是', 1), ('電腦', 1), ('科學', 1), ('、', 1), ('心理', 1), ('學', 1), ('、', 1), ('哲學', 1), ('等', 1), ('學科', 1), ('融', 1), ('合', 1), ('的', 1), ('跨', 1), ('領域', 1), ('學科', 1), ('。', 1)]

sample_text:  高雄市充滿藝術氣息與海港風情，擁有獨具特色的駁二藝術特區、充滿藝術氛圍的美術館、現代化的高雄流行音樂中心、大型展覽館與會議中心、以及壯觀的高雄港等多元旅遊景點。透過便捷的輕軌與大眾運輸工具，到訪的旅客可以輕鬆地往返這些地點，體驗高雄豐富的文化與歷史並享受美好的時光。程式設計師們越來越依賴 Git 進行版本控制、使用 Python、Java 或 JavaScript 等程式語言開發 Web 應用程式，還需要在 Linux 或 Windows 作業系統上操作，並熟悉各種資料庫系統如 MySQL、MongoDB 和 PostgreSQL，以及對 API、RESTful 架構和 Docker 容器化技術有深入了解，這都是為了追求在軟體開發領域的卓越表現。
old:  [('高', 2), ('雄', 2), ('市', 2), ('充', 2), ('滿', 3), ('藝', 3), ('術', 2), ('氣', 2), ('息', 3), ('與', 2), ('海', 2), ('港', 3), ('風

In [None]:
# @markdown For generating random sample text
!pip install git+https://github.com/zetavg/python_wikipedia pangu
import wikipedia
import pangu
wikipedia.set_lang("zh-tw")


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/zetavg/python_wikipedia
  Cloning https://github.com/zetavg/python_wikipedia to /tmp/pip-req-build-keqzyuk6
  Running command git clone --filter=blob:none --quiet https://github.com/zetavg/python_wikipedia /tmp/pip-req-build-keqzyuk6
  Resolved https://github.com/zetavg/python_wikipedia to commit 2e9c2bfc63217af36cf7355e0c157a63c4738a2e
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pangu
  Downloading pangu-4.0.6.1-py3-none-any.whl (6.4 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11802 sha256=9cff5c34610009ac9ada049d05ca573df14a2ff0e3f19c8b903d77ad61c0f679
  Stored in directory: /tmp/pip-ephem-wheel-cache-vav8k9ci/wheels/9b/d3/d8/2d3cc48d53fdd151a7ddbe0bf6f56c91bce355ef96777fc6d1
Successfu

In [None]:
# @markdown More random examples

for i in range(0, 20):
    page = wikipedia.random()
    try:
        text = pangu.spacing_text(
            wikipedia.summary(page, sentences=1)
        )
        if not example_text:
            continue
        print("example_text: ", text)
        print("old: ",
              list(tokenize_cjk(old_tokenizer, text)))
        print("new: ",
              list(tokenize_cjk(new_tokenizer, text)))
        print("")
    except:
        pass

example_text:  大青山戰鬥遺址，位於山東省臨沂市沂南縣沂南縣、費縣交界區，是中華人民共和國山東省文物保護單位之一。
old:  [('大', 1), ('青', 3), ('山', 3), ('戰', 2), ('鬥', 3), ('遺', 2), ('址', 3), ('，', 3), ('位', 2), ('於', 2), ('山', 3), ('東', 2), ('省', 2), ('臨', 3), ('沂', 3), ('市', 2), ('沂', 3), ('南', 2), ('縣', 3), ('沂', 3), ('南', 2), ('縣', 3), ('、', 1), ('費', 3), ('縣', 3), ('交', 2), ('界', 2), ('區', 2), ('，', 3), ('是', 1), ('中', 1), ('華', 3), ('人', 1), ('民', 2), ('共', 2), ('和', 3), ('國', 2), ('山', 3), ('東', 2), ('省', 2), ('文', 2), ('物', 2), ('保', 2), ('護', 2), ('單', 3), ('位', 2), ('之', 1), ('一', 1), ('。', 1)]
new:  [('大', 1), ('青', 1), ('山', 1), ('戰鬥', 1), ('遺址', 1), ('，', 1), ('位於', 1), ('山東省', 1), ('臨', 1), ('沂', 1), ('市', 1), ('沂', 1), ('南', 1), ('縣', 1), ('沂', 1), ('南', 1), ('縣', 1), ('、', 1), ('費', 1), ('縣', 1), ('交界', 1), ('區', 1), ('，', 1), ('是中華人民共和國', 1), ('山東省', 1), ('文物', 1), ('保', 1), ('護', 2), ('單位', 1), ('之一', 1), ('。', 1)]

example_text:  十方控股有限公司（英語：ShiFang Holding Limited），（港交所：1831），是一所綜合性印刷電子媒體廣告服務供應商，公司地址位於中國福建省福州市鼓樓區東街

### Save the Tokenizer

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
new_tokenizer.push_to_hub(
    "test-zh-tw-tokenizer-20230427-2",
    private=True
)

ValueError: ignored