reference: https://colab.research.google.com/drive/1876dq54hRsWGWdsrC6E4cYdxmacb5a4S?usp=sharing#scrollTo=KDKaWJI2aAxZ

**Summary** <br>
* We use the tokenizers library to create a tokenizer and train it on a sample text. <br>
* The trained tokenizer is saved to a file (tokenizer.json) and then loaded back.<br>
* The sample text is tokenized using the loaded tokenizer.<br>
* A DataFrame is created from the tokens.<br>
* We use WandB to log the vocabulary DataFrame as an artifact.<br>

"As we will see in the next sections, a tokenizer cannot be trained on raw text alone. Instead, we first need to split the texts into small entities, like words. That's where the pre-tokenization step comes in. As we saw in Chapter 2, a word-based tokenizer can simply split a raw text into words on whitespace and punctuation." HF course. <br>

In our case, this step is really simple, we need our pretokenization to split our text in "words" since our dataset is already a series of tokens. So a Whitespace pre_tokenizer would work fine here. The model we will use is, again, "WordLevel"

transfo-xl — TransfoXLConfig (Transformer-XL model)

In [6]:
#!pip install wandb

Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 1.9 MB/s eta 0:00:00
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.39.1-py2.py3-none-any.whl (254 kB)
     -------------------------------------- 254.1/254.1 kB 1.3 MB/s eta 0:00:00
Collecting GitPython!=3.1.29,>=1.0.0
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
     -------------------------------------- 190.6/190.6 kB 1.7 MB/s eta 0:00:00
Collecting setproctitle
  Downloading setproctitle-1.3.3-cp39-cp39-win_amd64.whl (11 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.11-py3-none-any.whl (62 kB)
     ---------------------------------------- 62.7/62.7 kB 3.3 MB/s eta 0:00:00
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.1-py3-none-any.whl (24 kB)
Installing collected packages: smmap, setproctitle, sentry-sdk, docker-pycreds, git

In [9]:
from tokenizers import Tokenizer, trainers, models, pre_tokenizers
from tokenizers.pre_tokenizers import WhitespaceSplit
import pandas as pd
import wandb

In [25]:
# File path
file_path = "C:/Users/naomi/Thesis/Thesis/Thesis-main/tokenized_output_v2/all_tokenized_outputs.txt"

# Read the content of the file
with open(file_path, 'r') as file:
    all_tokenized_outputs = file.readlines()

# Calculate the index for the 10% split
split_index = int(0.1 * len(all_tokenized_outputs))

# Extract the 11th sample (index 10)
sample_10 = all_tokenized_outputs[10]

# Take the first 242 characters of the sample
sample = sample_10[:242]

# Output the content of the sample variable
print(sample)

# Create a new WordLevel tokenizer with vocabulary including [UNK]
new_tokenizer = Tokenizer(models.WordLevel(vocab=["[UNK]"]))

# Initialize Tokenizer
new_tokenizer = Tokenizer(models.WordLevel())

# Add pretokenizer
new_tokenizer.pre_tokenizer = WhitespaceSplit()


# Yield batches of 1,000 texts
def get_training_corpus():
    dataset = all_tokenized_outputs  # Use all_tokenized_outputs directly
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]


# Trainer
trainer = trainers.WordLevelTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])


PIECE_START TRACK_START INST=0 BAR_START TIME_DELTA=24 NOTE_ON=70 NOTE_ON=59 NOTE_ON=63 NOTE_ON=47 TIME_DELTA=3 NOTE_OFF=70 TIME_DELTA=1 NOTE_ON=78 TIME_DELTA=3 NOTE_OFF=78 TIME_DELTA=2 NOTE_ON=66 NOTE_ON=80 NOTE_ON=54 NOTE_OFF=59 TIME_DELTA=


TypeError: argument 'vocab': failed to extract enum PyVocab ('Vocab | Filename')
- variant Vocab (Vocab): TypeError: failed to extract field PyVocab::Vocab.0, caused by TypeError: 'list' object cannot be converted to 'PyDict'
- variant Filename (Filename): TypeError: failed to extract field PyVocab::Filename.0, caused by TypeError: 'list' object cannot be converted to 'PyString'

In [21]:
# Train the tokenizer
new_tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [22]:
# Save the trained tokenizer
new_tokenizer.save("trained_tokenizer.json")

In [23]:
# Load the trained tokenizer
loaded_tokenizer = Tokenizer.from_file("trained_tokenizer.json")
loaded_tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()

In [24]:
# Tokenize the sample text using the loaded tokenizer
encoded = loaded_tokenizer.encode(sample)
tokens = encoded.tokens

Exception: WordLevel error: Missing [UNK] token from the vocabulary

In [None]:
# Create a DataFrame for the vocabulary
vocab_df = pd.DataFrame(
    [{"Token": token, "Index": idx} for idx, token in enumerate(tokens)]
)

In [None]:
# Initialize W&B run
wandb.init()

In [None]:
# Create a table with vocab
vocab_table = wandb.Table(data=vocab_df)

In [None]:
# Create an artifact for raw data
processed_data_at = wandb.Artifact(name="processed_data", type="processed_data")
processed_data_at.add(vocab_table, name="vocab_table")

In [None]:
# Log the artifact
wandb.log_artifact(processed_data_at)