# Training a Custom Tokenizer for StackOverflow QA Pairs (C Language)

This notebook demonstrates how to train your own tokenizer using Hugging Face's Transformers and Tokenizers libraries. Custom tokenizers are useful when you want your language model to better understand the specific vocabulary and structure of your dataset. Here, we focus on StackOverflow QA pairs related to the C programming language.

We'll explore different tokenization algorithms and show how to build and train them step by step.

## Why Train Your Own Tokenizer?

Most transformer models use subword tokenization algorithms, which need to be trained to recognize common patterns in your data. By training a tokenizer on your own dataset, you help your model understand domain-specific words and phrases, leading to better results when training or fine-tuning language models.

For more details, you can check out the Hugging Face course <a href="https://huggingface.co/learn/llm-course/en/chapter2/4">chapter</a> on tokenization.

### Prepare with Data

In [60]:
from datasets import load_dataset

dataset = load_dataset("Mxode/StackOverflow-QA-C-Language-40k", trust_remote_code=True)    
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 40649
    })
})

In [61]:
print(dataset["train"][:5]) 

{'question': ['\nConsider this implementation ofstrtok()in C\n\n```\nchar *pt;\npt = strtok(line, ":");\nif (pt != NULL)\n{\n    pt = strtok(NULL, ":");\n}\n```\n\nwhy doesn\'tpthave to be explicitly allocated memory? Likept[128]orpt = malloc(...)? I would have thought the above implementation would segfault. Do I have to worry about deallocation?\n', '\nI am trying to extract two last characters from achararray and has been unable to do so. The array will be given to me and I don\'t have control over it. I only know the lastnpositions of the array are digits. In the following casen=2:\n\n```\nchar c[5] = "xyz45"\nchar d[2];\nd[0] = c[3];\nd[1] = c[4];\nint e = atoi(d);\n```\n\nClearly the value oferequired is45. Any way to solve this? (The above approach is a representative way about how one might go about doing this in python. I am looking for an elegant way to do this.)\n', '\nI am trying to figure out this issue:\nI tried to run this two easy codes\n\n```\nint main(void) {\n    whi

In [66]:
print(dataset['train'][0]['question'])


Consider this implementation ofstrtok()in C

```
char *pt;
pt = strtok(line, ":");
if (pt != NULL)
{
    pt = strtok(NULL, ":");
}
```

why doesn'tpthave to be explicitly allocated memory? Likept[128]orpt = malloc(...)? I would have thought the above implementation would segfault. Do I have to worry about deallocation?



In [68]:
print(dataset['train'][0]['answer'])


linehas to reference modifiablechararray and moststrtokimplementations are using this memory.

It is the reason why you do not have to provide any additional memory for this operation.

Remember thatlinewill be modified (destroyed) during this operation.

ptwill hold (if notNULL) the reference to one of the elements of the array referenced byline



In [84]:
dataset = dataset['train']

### Efficient Data Loading for Training

When working with large datasets, it's best to process data in batches rather than loading everything into memory. We use a Python generator to yield batches of texts, which makes training more memory-efficient and scalable.

In [85]:
batch_size = 1000
def batch_generator():
    for i in range(0, len(dataset), batch_size):
        for k in ['question', 'answer']:
            yield dataset[k][i: i + batch_size]

In [70]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

### Training a Tokenizer Based on an Existing Model

If you want your tokenizer to use the same algorithm and parameters as a popular model (like Llama or GPT-2), you can start from that model and retrain its tokenizer on your own data. This is a quick way to adapt a proven tokenization strategy to your specific domain.

In [71]:
tokenizer.is_fast

True

In [13]:
tokenizer.vocab_size

128000

In [16]:
new_tokenizer = tokenizer.train_new_from_iterator(batch_generator(), vocab_size=30000)

In [75]:
print(new_tokenizer('This is a demo string copied from stackoverflow! 😊'))

{'input_ids': [0, 204, 10, 8, 5079, 100, 2220, 56, 10513, 434, 5, 2, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [76]:
new_tokenizer.vocab_size

30000

In [77]:
new_tokenizer.get_vocab()

{'wich': 17351,
 '⊕': 29716,
 '▁pclose(fp);\n': 24087,
 '▁arg2': 10136,
 'fnct': 22572,
 'RHS': 15949,
 'BOOL': 10450,
 '▁<string.h>\n': 2230,
 ');\n```\n\nI': 1181,
 '▁GET_RED_TEXT(': 27571,
 '-E': 11604,
 "getenv('PWD')": 23793,
 '▁decides': 12434,
 'Copie': 24988,
 '▁want': 54,
 '▁encoder': 19737,
 '▁rhs': 22473,
 'spaces': 13497,
 'immediately': 15186,
 '▁messages': 1764,
 ';\nstruct': 3046,
 'Bytes': 6269,
 'truct/union/enum': 28798,
 'room_dim': 22322,
 '?\n\nP.S.': 14381,
 '▁__cplusplus\n': 12025,
 '▁main.o': 8315,
 '▁below.': 5913,
 '13,': 12834,
 'introspection': 22213,
 '▁thread-local': 20888,
 '▁<string.h>\n\nint': 3303,
 '▁<errno.h>\n': 12415,
 '▁strdup(': 8091,
 'Conversely': 24713,
 'g_signal_': 23016,
 '▁based': 944,
 'STM32Cube': 25120,
 ':\n\n```\nchar*': 3260,
 '1)\n{\n': 11050,
 'feof(fp)': 24271,
 ',i,j,k,': 24124,
 ';j++)\n': 7409,
 'hal': 13810,
 '0x4': 17077,
 '(4,': 14593,
 'PCRE': 14471,
 '▁Erlang/Elixir': 28266,
 '▁five': 6741,
 "▁'one": 15096,
 '▁arrays': 635

### Saving and Sharing Your Tokenizer

Once your tokenizer is trained, you can save it locally for future use or push it to the Hugging Face Hub to share with others. This makes it easy to reload your tokenizer or use it in other projects.

In [20]:
new_tokenizer.save_pretrained("toks/llama3-stackoverflow")

('toks/llama3-stackoverflow/tokenizer_config.json',
 'toks/llama3-stackoverflow/special_tokens_map.json',
 'toks/llama3-stackoverflow/tokenizer.json')

In [None]:
from huggingface_hub import login

login(token="<YOUR HF TOKEN WITH WRITE PERMISSIONS>")

In [49]:
new_tokenizer.push_to_hub("llama3-stackoverflow-QA-C-language-40k")

CommitInfo(commit_url='https://huggingface.co/sinsankio/llama3-stackoverflow-QA-C-language-40k/commit/b10d0331d5146b43a24d475b71d88b0b74f89acd', commit_message='Upload tokenizer', commit_description='', oid='b10d0331d5146b43a24d475b71d88b0b74f89acd', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sinsankio/llama3-stackoverflow-QA-C-language-40k', endpoint='https://huggingface.co', repo_type='model', repo_id='sinsankio/llama3-stackoverflow-QA-C-language-40k'), pr_revision=None, pr_num=None)

### Building a Tokenizer from Scratch

If you want full control over the tokenization process, you can build a tokenizer step by step using the Tokenizers library. This lets you choose the normalization, pre-tokenization, model type, post-processing, and decoding methods that best fit your data and use case.

Below, we show how to build WordPiece, BPE, and Unigram tokenizers, similar to those used in BERT, GPT-2, and Albert.

#### Steps to Build a Tokenizer Pipeline

A tokenizer pipeline usually includes:
- **Normalization**: Clean and standardize text (e.g., lowercasing, removing accents).
- **Pre-tokenization**: Split text into words or subwords.
- **Model**: Learn subword units from your data.
- **Post-processing**: Add special tokens for model compatibility.
- **Decoding**: Convert tokens back to text.

You can mix and match these steps to create a tokenizer that fits your needs.

In [78]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

In [79]:
tokenizer.normalizer = normalizers.BertNormalizer(clean_text=True, strip_accents=True)

In [80]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

In [81]:
tokenizer.pre_tokenizer.pre_tokenize_str('This is a demo string copied from stackoverflow! 😊')

[('This', (0, 4)),
 ('is', (5, 7)),
 ('a', (8, 9)),
 ('demo', (10, 14)),
 ('string', (15, 21)),
 ('copied', (22, 28)),
 ('from', (29, 33)),
 ('stackoverflow', (34, 47)),
 ('!', (47, 48)),
 ('😊', (49, 50))]

In [82]:
special_tokens = ['[UNK]', '[PAD]', '[CLS]', '[SEP]', '[MASK]']
trainer = trainers.WordPieceTrainer(vocab_size=30_000, special_tokens=special_tokens)

In [86]:
tokenizer.train_from_iterator(batch_generator(), trainer=trainer)






In [87]:
encoding = tokenizer.encode_batch(["Hello, y'all! How are you 😁 ?", 'Huggingface made it easy!'])

In [88]:
encoding[0].tokens

['hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?']

In [89]:
encoding[1].tokens

['hug', '##ging', '##face', 'made', 'it', 'easy', '!']

In [90]:
cls_token_id = tokenizer.token_to_id('[CLS]')
sep_token_id = tokenizer.token_to_id('[SEP]')   
print(cls_token_id, sep_token_id)

2 3


In [91]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ]
)

In [92]:
encoding = tokenizer.encode_batch(["Hello, y'all! How are you 😁 ?", 'Huggingface made it easy!'])

In [93]:
encoding[0].tokens

['[CLS]',
 'hello',
 ',',
 'y',
 "'",
 'all',
 '!',
 'how',
 'are',
 'you',
 '[UNK]',
 '?',
 '[SEP]']

In [94]:
encoding[1].tokens

['[CLS]', 'hug', '##ging', '##face', 'made', 'it', 'easy', '!', '[SEP]']

In [95]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [96]:
from transformers import BertTokenizerFast

new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

In [97]:
encoding = new_tokenizer.encode(["Hello, y'all! How are you 😁 ?", 'Huggingface made it easy!'])

In [98]:
print(encoding)

[2, 1236, 16, 67, 11, 761, 5, 697, 721, 605, 0, 35, 3, 20288, 6501, 4816, 2207, 602, 2120, 5, 3]


### Byte-Pair Encoding (BPE) Tokenizer

BPE is a popular tokenization algorithm used in models like GPT-2. It splits text into subword units based on the frequency of character pairs, allowing the tokenizer to handle rare words and misspellings more effectively.

In [99]:
tokenizer = Tokenizer(models.BPE())

In [100]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

In [101]:
tokenizer.pre_tokenizer.pre_tokenize_str('This is a demo string copied from stackoverflow! 😊')

[('This', (0, 4)),
 ('Ġis', (4, 7)),
 ('Ġa', (7, 9)),
 ('Ġdemo', (9, 14)),
 ('Ġstring', (14, 21)),
 ('Ġcopied', (21, 28)),
 ('Ġfrom', (28, 33)),
 ('Ġstackoverflow', (33, 47)),
 ('!', (47, 48)),
 ('ĠðŁĺĬ', (48, 50))]

In [103]:
trainer = trainers.BpeTrainer(vocab_size=30000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(batch_generator(), trainer=trainer)






In [104]:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
tokenizer.decoder = decoders.ByteLevel()

In [105]:
from transformers import GPT2TokenizerFast

new_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

In [106]:
encoding = new_tokenizer.encode(["Hello, y'all! How are you 😁 ?", 'Huggingface made it easy!'])

In [107]:
print(encoding)

[1664, 12, 251, 7, 353, 1, 894, 384, 264, 26360, 187, 164, 783, 40, 927, 2104, 2005, 2179, 252, 2380, 1]


### Unigram Tokenizer

Unigram tokenization, used in models like Albert and T5, selects subword units based on their probability in the corpus. This approach can be more flexible and often works well for languages with complex word structures.

In [108]:
tokenizer = Tokenizer(models.Unigram())

In [109]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Replace('''"''', "'")]
)
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

In [110]:
tokenizer.pre_tokenizer.pre_tokenize_str("This is a demo string copied from stackoverflow! 😊")

[('▁This', (0, 4)),
 ('▁is', (4, 7)),
 ('▁a', (7, 9)),
 ('▁demo', (9, 14)),
 ('▁string', (14, 21)),
 ('▁copied', (21, 28)),
 ('▁from', (28, 33)),
 ('▁stackoverflow!', (33, 48)),
 ('▁😊', (48, 50))]

In [112]:
trainer = trainers.UnigramTrainer(vocab_size=30000, special_tokens=["CLS", "[SEP]", "<unk>", "<pad>", "[MASK]"], unk_token="<unk>")
tokenizer.train_from_iterator(batch_generator(), trainer=trainer)





In [113]:
cls_token_id = tokenizer.token_to_id('CLS')
sep_token_id = tokenizer.token_to_id('[SEP]')
print(cls_token_id, sep_token_id)

0 1


In [114]:
tokenizer.post_processor = processors.TemplateProcessing(
    single="[CLS]:0 $A:0 [SEP]:0",
    pair="[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ],
)
tokenizer.decoder = decoders.Metaspace()

In [115]:
from transformers import AlbertTokenizerFast

new_tokenizer = AlbertTokenizerFast(tokenizer_object=tokenizer)

In [116]:
encoding = new_tokenizer.encode(["Hello, y'all! How are you 😁 ?", 'Huggingface made it easy!'])

In [117]:
print(encoding)

[0, 6081, 9, 5, 152, 42, 1255, 434, 305, 41, 19, 5, 2, 5, 61, 1, 5, 15137, 13105, 12270, 869, 17, 1001, 434, 1]


In [118]:
new_tokenizer.decode(encoding)

"CLS Hello, y'all! How are you <unk> ?[SEP] Huggingface made it easy![SEP]"

## Next Steps: Using Your Tokenizer

Now that you have trained and saved your custom tokenizer, you can use it to train a language model from scratch or fine-tune an existing model. Just pass your tokenizer to the training scripts or notebooks, and your model will be able to process your domain-specific data more effectively.

For more advanced usage, see the Hugging Face <a href='https://huggingface.co/docs/transformers/en/main_classes/tokenizer'>documentation</a> and examples for language modeling with custom tokenizers.