# Train a new tokenizer based on smart contract opcodes

LLMs or other types of language models cannot accept direct words or characters in text as raw inputs. They usually need a preprocessing phase called tokenization to transform the text input into a form more suitable for numerical processing.

Many pretraining models, such as GPT2, the one we use here, come with their own "tokenizer" that was already trained with natural language text, such as the English Wikipedia. However, since there is no available Tokenizer for Ethereum smart contract bytecode, we are going to train our own! Cool :)

In [None]:
# First we call the necessary libraries
# We are going to use HuggingFace as our AI framework
# It really makes our life easier when dealing with LLMs
from datasets import load_dataset
from transformers import AutoTokenizer
from collections import defaultdict

# Then we load the data we collected in our previous tutorial
dataset = load_dataset("text", data_files={"train": "/data/forta/ethereum/text/pretraining/pretraining_train.csv",
                                           "val": "/data/forta/ethereum/text/pretraining/pretraining_val.csv"})

In [None]:
max = -1
for row in dataset["train"]["text"]:
    length_of_the_messages = row.split(" ")
    max = len(length_of_the_messages) if len(length_of_the_messages) > max else max
print("Max number of words = ", max)

In [None]:
def get_training_corpus():
    batch_size = 400
    aux_dataset = dataset["train"]
    for start_idx in range(0, len(aux_dataset), batch_size):
        samples = aux_dataset[start_idx : start_idx + batch_size]
        yield samples["text"]

training_corpus = get_training_corpus()

In [None]:
# Just to visualize process we compare the current GPT2 tokenizer with the one we are going to train
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(old_tokenizer.tokenize("PUSH1 PUSH1 MSTORE PUSH1 CALLDATASIZE LT PUSH2"))

# GPT2 tokenizer training from scratch

Here, we define a vocabulary size that we found after repeating the training process several times. We take the GPT2 tokenizer and train it on the newly collected bytecode text corpus.

In [None]:
vocab_size = 524
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, vocab_size)
# Print an example just to compare with the pretrained tokenizer
print(old_tokenizer.tokenize("PUSH1 PUSH1 MSTORE PUSH1 CALLDATASIZE LT PUSH2"))

In [None]:
# Finally store the resulting tokenizer so we can use it later in our LLM pretraining process.
tokenizer.save_pretrained("/data/forta/ethereum/tokenizer")

In our following notebook tutorial [Finetuning data collection](notebook_3_GPT_finetuning_data_collection.ipynb), we will describe how to collect the data that will be used later in our example finetuning tutorial.