## Load the sequence

This time, we'll load a sample from the text sequence instead of the entire dataset to prevent excessive RAM usage. If the RAM is full, the BPE algorithm won't function properly due to a lack of available memory.  

Adjust the `number_of_characters_to_read` value to find the optimal setting for your system.

In [None]:
with open("../data/AtlaSetCombined.txt", "r") as f:
    number_of_characters_to_read = 10_000_000
    text_sequence = f.read(number_of_characters_to_read)

len(text_sequence)

## BPE algorithm

I am using the [minBPE](https://github.com/karpathy/minbpe) repository to tokenize the sequence of text.

In [None]:
import sys
sys.path.append('..')

Start by training the tokenizer on the text sequence that you saved in the previous notebook.

In [None]:
from minbpe import RegexTokenizer

tokenizer = RegexTokenizer()
tokenizer.train(text_sequence, vocab_size=16_384)

Visualize the vocabulary.

In [None]:
vocab = tokenizer.vocab
vocab

Test the tokenizer.

In [None]:
tokenizer.encode("Salam labas")

In [None]:
tokenizer.decode([83, 1813, 3363, 32, 7312, 3770, 115])

Add special tokens to the vocabulary. These tokens are going to be used a lot in the fine-tuning step.

In [None]:
max_vocab_id = list(tokenizer.vocab.keys())[-1]
tokenizer.special_tokens = {
    "<|startoftext|>": max_vocab_id + 1,
    "<|separator|>": max_vocab_id + 2,
    "<|endoftext|>": max_vocab_id + 3,
    "<|unk|>": max_vocab_id + 4,
    "<|padding|>": max_vocab_id + 5
}

Save the tokenizer

In [None]:
tokenizer.save(file_prefix="../output/tokenizer/darija_tokenizer")