1) 'my own' - tokenizer

this is a personal project where i develop core concepts of machine learning and deep learning from scratch.

the main goal is to understand those core algorithms i use on a daily basis.

the path i'll follow is:

learn the theory
write a summary with my own words
code the functionality completly from scratch using python without main libraries
code the functionality using dedicated libraries or even libraries made to solve the algorithm in a sophisticated and efficient way, generally already used in the industry, but still from scratch

*the following text was written as quick notes, just to "prove" my knowledge to myself, probably not well written and of course super informal

tokenizer:

is a way to chunk the text from a given sentence to encode into a numerical representation. this is done to feed the llm with a numerical representation of text, aka something that the llm can understand.

those numbers (the text) are pointing to a specific 'coordinates' in n dimensional representation. each point in that space is an specific representation of a token.

a embedding table is used to encode/decode.

in gpt2 the tokenizer vocabulary have ~50k tokens (~100k in gpt3-5 and 4). the contenxt size is 1024 tokens (that mean that is taking attention to 1024 tokens)

one of the main issues with tokenizer and languages were how they tokenize languages but english. any sentence in a language with diff alphabet and/or a way of write just tokenize with a lot more of tokens that english. this make bloating up all the text and it's separated across way too much of the sequence. this kind of situations change with modern models like gpt3 or gpt4.

a lot of side situations are (were) or can be (could be) related to tokenization.

can't spell words
bad at arithmetic
work better with YAML than JSON
can't do simple string processing

if we go from gpt2 to gpt4 tokenizer, we'll see that the tokenizer 'win' almost x2 of context attention because the tokenizer is incledible better. (we need to have in consideration the context window, the tokens that the model can have attention to, and amount of tokens generated by tokenizer) why? simple because gpt4 tokenizer use less tokens to represent the same amount of text.

we can see this in:

japanese and korean.
code (it's INCREDIBLE better with white spaces)

as i said before, this laids to more 'compressed' representation of sentences. this means, more context can be in the attention window.

as we cannot supply the llm with raw text or bytes (chars representation as bytes) because the context lenght would useless for attention (any sentence would be too long), we gotta use:

byte pair encoding algorithm (BPE):

first of all we've to assume that all unique characters are initialized to a 1-char long n-gram (initial "token").
then, successively the most frequent pair of adjacent characters is related to a new 2-char long n-gram, which will replace the pair of characters in any sentence with this new token generated.
this workflow is repeated until a vocabulary of 'x' size is created. any new work will be constructed from final representation (vocabulary of tokens) and the initial set of individual characters.
all the unique tokens generated from a specific corpus, ios called 'token vocabulary'

example from BPE wiki: aaabdaaabac -> notice pair of 'aa'

ZabdZabac -> pair of 'aa' now is represented as Z Z=aa

ZYdZYac -> pair of 'ab' is repeated, now represented as Y Y=ab Z=aa

XdXac -> pair of 'ZY' is repeated, now represented as X X=ZY Y=ab Z=aa

of course we've to know that the tokenizer is completly isolated from the LLM architecture. so it have his own train loop, dataset.

basically if we've to decompose a simple tokenizer we need:

get_pairs(tokens) -> function to obtain the repeated pairs in the dataset (text): is an iteration between the encoded text (dataset) using utf-8 (characters representation in raw bytes, 0...255). in that iteration the pairs repeated are count and store as tuple.
merge(tokens, idx_start, pairs) -> function to map the top pairs tuple to new id representation: with the list of tuples done we set a convenience number to start from our 'new indexes', in this case 256. why? bc our numbers are represented in utf-8 that goes from 0 to 255. we've the tokens (bytes representation done before even get the pairs), index to start from (256) and the pairs (top_pair will be used in the function) we wanna represent with a new index. we just iterate through the tokens and: if an specific token matches with 'pair[0]' and the next one matches with 'pair[1]' is appended to a new ids list. the iteration is done in all the tokens, one by one.

*of course this functions must be run in loop to work with a lot of pairs and sentence tokens.

when the vocab is done:

encode(raw_text) -> function to transfor raw text into tokens, using the map (dict in python) created with merge(): start encoding the text in tokens with utf-8, as is done before get the pairs. why? it will do get all the pairs repeated again with those tokens with 'get_pairs(tokens)'. then, filter the pairs got from merge function (each pair and new representation). get idx_start from merges[pair] (final pair represented), get tokens from merge() and that's it. tokens returned.
decode(ids) -> function to transform encoded text (aka ids, in the map that we've created) to plain text/string: 'ids' would be a list of numerical representation of text (if you don't get it just skip, tokens man, tokens), basically sentence/s tokenized. those tokens are 'mapped' with 'vocab' (vocab is constructed from bytes representation (256 chars) and mapped to 'merges', the dict of pairs representation. basically the vocab with the one we encode). tokens, now, are just a string like {b"my name is valentin"}, which is a byte string actually. that byte string is decoded using utf-8, to finally, give us the text.

usually when you train the tokenizer with a lot of data, between each text doc, is added an 'special token'. for example <|endoftext|>. the goal for this, is 'explain' the tokenizer that a text doc ends and a new one starts.

same happend when you fine-tune the model. all the '*-instruct' models, fine-tuned to handle a assistant/chat behavior, have particular special tokens like: <|im_start|> <|im_end|> this is done in order to denote structure to the model, and make it understand when a particualar sentence start and ends.

of course you've to do some little changes in the transformer and the params to handle those special tokens. bc your adding integers and you've to make sure about that integer is correctly standed in a vector, specifically for that token. same with the final layer, you've to make sure that every piece of embedding is extended by some of the special tokens.

something else i forgot to mention is the regex used in gpt2 and gpt4 tokenizer. (i figured out about this when i watched karpathy's video)

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]|\s[\r\n]|\s+(?!\S)|\s+"""

in this way the tokenizer ensures some key conditions they want to have when "splitting" text. as we can see, both are tokenizing on "'ll", "'ve" and "*'re", same split are doing when find numbers, they're getting the first space attached to any word and not the last one " example".

gpt-4 tokenizer is a little bit more complete, it works on lower and upper case, while gpt2 doesn't. it deletes bigger spaces, just get numbers up to 3 digits, and more.

if we wanna train, not a dummy tokenizer, 'sentencepiece' is the more used library for that in the industry because can do train and inference, which's not allowed by tiktoken (openai tokenizer). i thing hugging face tokenizer is really good as well but dunno if used rn.

sentencepiece is from google and is used by llama, mistral and more llms.

so i was looking more about the vocab size, and why they choose x number. all the numbers are between 50k and 100k and is just based on experiments. we've to concern about rare pairs or new tokens because some new tokens created could be too rare (like just appears once)

as vocab size increases the model embedding table do the same, and of course, the lm head because it have to calculates the logits for more tokens. this all give us the need of more computational power.

vocab size could be extended, as i mentioned before, when fine-tuning for example. this would have some little changes in the model as well.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
gpt-2-attributes		gpt-2-attributes
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
tokenizer.txt		tokenizer.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1) 'my own' - tokenizer

tokenizer:

About

Releases

Packages

Languages

valenradovich/my-own-tokenizer

Folders and files

Latest commit

History

Repository files navigation

1) 'my own' - tokenizer

tokenizer:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages