# Training a new tokenizer from an old one

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


You will need to setup git, adapt your email and name in the following cell.

In [2]:
!git config --global user.email "417846472@example.com"
!git config --global user.name "splendidsummer"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
# pip install transformers datasets tokenizers
from transformers import AutoTokenizer
from datasets import load_dataset

# 1) 准备你的语料迭代器（建议用你自己的领域文本）
# 这里示例用 wikitext，小项目也可以把本地txt读成一行一行的列表
def corpus_iterator():
    ds = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")  # 可换成你的数据
    for ex in ds:
        text = ex["text"].strip()
        if text:
            yield text

In [5]:
#　check how to decide the vocab size
vocab_size = 52000

In [6]:
# 2) 选择一个“基底”分词器（保持其 token 化风格/规则）
#    可换成你的模型对应的 tokenizer，比如 "meta-llama/Llama-3.1-8B"（需有访问权限）
base_tokenizer_id = "gpt2"  # 也可用 "bert-base-uncased"、"bigcode/starcoder2-tokenizer" 等
base_tok = AutoTokenizer.from_pretrained(base_tokenizer_id, use_fast=True)

# 3) 决定新词表大小与特殊符号（按需要调整）
# how could I decide the vocab_size for fine-tuning the new tokenizer

special_tokens = list(set([
    base_tok.unk_token or "<unk>",
    base_tok.pad_token or "<pad>",
    base_tok.bos_token or "<s>",
    base_tok.eos_token or "</s>"
]))  # 去重

# 4) 基于语料“再训练”得到一个新分词器（底层依旧是 fast tokenizer）
new_tok = base_tok.train_new_from_iterator(
    corpus_iterator(),
    vocab_size=vocab_size,
    new_special_tokens=special_tokens,
)

# 5) 保存（可直接配合 transformers 模型使用）
save_dir = "tokenizer_new_from_base"
new_tok.save_pretrained(save_dir)
print(f"Saved to: {save_dir}")

# 6) 小测试：看下新分词器对领域词的切分是否更友好
sample = "We fine-tune a tokenizer for bio-medical entities like BRCA1 and EGFR-TKI."
print(new_tok.tokenize(sample))


README.md: 0.00B [00:00, ?B/s]

wikitext-103-raw-v1/test-00000-of-00001.(…):   0%|          | 0.00/733k [00:00<?, ?B/s]

wikitext-103-raw-v1/train-00000-of-00002(…):   0%|          | 0.00/157M [00:00<?, ?B/s]

wikitext-103-raw-v1/train-00001-of-00002(…):   0%|          | 0.00/157M [00:00<?, ?B/s]

wikitext-103-raw-v1/validation-00000-of-(…):   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Saved to: tokenizer_new_from_base
['We', 'Ġfine', '-', 't', 'une', 'Ġa', 'Ġtoken', 'izer', 'Ġfor', 'Ġbio', '-', 'med', 'ical', 'Ġentities', 'Ġlike', 'ĠBR', 'CA', '1', 'Ġand', 'ĠE', 'G', 'FR', '-', 'T', 'K', 'I', '.']


In [7]:
# from tokenizers import AddedToken
# domain_terms = [
#     AddedToken("BRCA1", single_word=True),
#     AddedToken("EGFR", single_word=True),
#     AddedToken("EGFR-TKI", single_word=True),
#     AddedToken("PD-1", single_word=True),
#     AddedToken("HER2", single_word=True),
# ]
# new_tok.add_tokens(domain_terms)

5

In [8]:
print(new_tok.tokenize(sample))

['We', 'Ġfine', '-', 't', 'une', 'Ġa', 'Ġtoken', 'izer', 'Ġfor', 'Ġbio', '-', 'med', 'ical', 'Ġentities', 'Ġlike', 'Ġ', 'BRCA1', 'Ġand', 'Ġ', 'EGFR-TKI', '.']


In [None]:
from tokenizers import Tokenizer, AddedToken
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.normalizers import NFKC
from tokenizers.pre_tokenizers import Metaspace

base_tok = Tokenizer(Unigram())
base_tok.normalizer = NFKC()  # To be checked
# Metaspace 仅把空格变为可见前缀“▁”，不按标点切，利于学习连字符整体
# base_tok.base_tokpre_tokenizer = Metaspace(replacement="▁", add_prefix_space=True)

trainer = UnigramTrainer(
    vocab_size=64000,
    special_tokens=["<unk>", "<pad>", "<s>", "</s>"],
)

new_tok = base_tok.train_new_from_iterator(corpus_iterator(), trainer)

# # 显式把关键术语设为不可再切的“整词”
# domain_terms = [AddedToken("EGFR-TKI", single_word=True),
#                 AddedToken("BRCA1", single_word=True),
#                 AddedToken("HER2", single_word=True),
#                 AddedToken("PD-1", single_word=True)]
# new_tok.add_tokens(domain_terms)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# 6) 小测试：看下新分词器对领域词的切分是否更友好
sample = "We fine-tune a tokenizer for bio-medical entities like BRCA1 and EGFR-TKI."
print(new_tok.tokenize(sample))


In [None]:
# 5) 保存（可直接配合 transformers 模型使用）
save_dir = "tokenizer_new_from_base"
new_tok.save_pretrained(save_dir)
print(f"Saved to: {save_dir}")