The Large Language Model Tokenizer Algorithm
This project provides a command-line tool implemented in C++ for training a Byte Pair Encoding (BPE) model on a text corpus and encoding text using the learned model.
Byte Pair Encoding is a data compression technique that is commonly used in Natural Language Processing (NLP) for tokenisation. It helps manage large vocabularies and handle unknown words.
Here's a simplified overview of the BPE training process:
-
Initialisation:
- Start with a vocabulary consisting of all individual characters (or bytes) present in the training corpus.
- Represent the corpus as a sequence of these initial character/byte tokens.
-
Iteration:
- Count the frequency of all adjacent pairs of tokens in the current sequence.
- Identify the most frequent pair (e.g., 't' followed by 'h').
- Merge this most frequent pair into a single new token (e.g., 'th').
- Add this new token to the vocabulary.
- Replace all occurrences of the original pair in the sequence with the new merged token.
-
Repeat:
- Repeat the iteration step (counting, finding the most frequent pair, merging) for a predetermined number of merges or until the desired vocabulary size is reached.
The result of training is:
- A vocabulary containing the initial characters/bytes and the new merged tokens.
- An ordered list of merge rules indicating which pairs were merged to create which new tokens.
Encoding new text involves:
- Splitting the text into its initial character/byte sequence.
- Applying the learned merge rules in the same order they were learned during training to the sequence until no more merges can be applied.
- The final sequence of tokens (original characters/bytes and merged tokens) is the BPE-encoded representation.
This project uses CMake to generate build files for various build systems like Make and Ninja.
Prerequisites:
- A C++17 compliant compiler (like g++, Clang, or MSVC)
- CMake (version 3.10 or higher)
- A build tool (like
makeorninja)
Steps:
- Clone/Download: Get the project files.
- Create Build Directory:
cd byte-pair-encoding mkdir build cd build
- Configure with CMake:
- For Makefiles (Default on many Linux/macOS systems):
cmake ..
- For Ninja (Often faster):
cmake -G Ninja ..
- For Visual Studio (Windows): Open the folder in Visual Studio, or use CMake GUI, or run from a Developer Command Prompt:
# Example for VS 2019 cmake -G "Visual Studio 16 2019" -A x64 ..
- For Makefiles (Default on many Linux/macOS systems):
- Build:
- If using Make:
make
- If using Ninja:
ninja
- If using Visual Studio: Build the solution (
bpe_tool.sln) within the IDE or use MSBuild:msbuild bpe_tool.sln /property:Configuration=Release
- If using Make:
The executable bpe_tool (or bpe_tool.exe on Windows) will be created in the build directory.
Place your input text file (e.g., shakespeare.txt) in the main project directory. Run the tool from the build directory or copy the executable elsewhere.
1. Train a BPE Model:
# Usage: ./bpe_tool train <input_file> <vocab_size> <output_merges_file>
./bpe_tool train ../shakespeare.txt 1000 ../shakespeare.merges<input_file>: Path to the training text (e.g.,../shakespeare.txt).<vocab_size>: The target total vocabulary size (initial 256 bytes + number of merges). Must be > 256. Example:1000.<output_merges_file>: Path where the learned merge rules will be saved (e.g.,../shakespeare.merges).
2. Encode Text using a Trained Model:
# Usage: ./bpe_tool encode <input_file> <merges_file>
./bpe_tool encode ../my_text_to_encode.txt ../shakespeare.merges<input_file>: Path to the text you want to encode (e.g.,../my_text_to_encode.txt). Create this file with some sample text.<merges_file>: Path to the merge rules file created during training (e.g.,../shakespeare.merges).
The tool will print the resulting sequence of token IDs to the console.