<div style="text-align: center;">
    <img src="static/llm_mental_model.png" alt="Diagram illustrating the mental model of a large language model with interconnected nodes representing neural network layers processing text input. The workspace is clean and digital, conveying a neutral and focused atmosphere." width="1000"/>
</div>

In [1]:
import json

with open('data/fadat_noticias.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

contents = [item['content'] for item in data if 'content' in item]

with open('data/fadat_noticias.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(contents))

## Steps necessary for preparing the embeddings:

1. Splitting text into words;
2. Converting words into tokens;
3. Turning tokens into embedding vectors.

### 1. Splitting text into words

In [2]:
with open("data/fadat_noticias.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:90])

Total number of character: 239794
A FADAT tem como formar indivíduos capazes de buscar conhecimentos e de saber utilizá-los.


In [3]:
import re
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['A', 'FADAT', 'tem', 'como', 'formar', 'indivíduos', 'capazes', 'de', 'buscar', 'conhecimentos', 'e', 'de', 'saber', 'utilizá-los', '.', 'Desta', 'forma', ',', 'a', 'Mostra', 'Científica', 'da', 'FADAT', 'pretende', 'se', 'tornar', 'tornas', 'um', 'excelente', 'instrumento']


### 2. Converting words into token IDs

In [4]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

7721


In [5]:
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 10:
        break


('!', 0)
('#AltaPerformance', 1)
('#AplaudaUmProfessor', 2)
('#CompromissoComOsAlunos', 3)
('#Faculdade', 4)
('#FadatSempreComVocê', 5)
('#SempreComVocê', 6)
('#TimeVencedor', 7)
('#UnidosPelaVitória', 8)
('#VEMPRAJAOFADAT', 9)
('#fadatsemprecomvocê', 10)
('%', 11)


In [6]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab # vocab text -> token IDs
        self.int_to_str = {i:s for s,i in vocab.items()} # inverse vocab token IDs -> original text tokens
    
    def encode(self, text):
        """Process input text into token IDs"""
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        """Convert token IDs back into text"""
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) 
        return text

### 3. Turning tokens into embedding vectors

In [7]:
tokenizer = SimpleTokenizer(vocab)
text = "Os candidatos se dedicaram profundamente, produzindo redações que exaltaram com sensibilidade e precisão o impacto espiritual e social deixado por Dom Adélio Tomasin."
ids = tokenizer.encode(text)
print(ids)

[1599, 2930, 6864, 3596, 6343, 15, 6317, 6602, 6499, 4375, 3113, 6923, 3959, 6201, 5723, 4883, 4244, 3959, 7008, 3612, 6133, 845, 331, 2077, 18]


In [8]:
print(tokenizer.decode(ids))

Os candidatos se dedicaram profundamente, produzindo redações que exaltaram com sensibilidade e precisão o impacto espiritual e social deixado por Dom Adélio Tomasin.


**Error if you try to include a word that is not contained on the data:**

In [14]:
text = "The candidates [...]"
try:
    print(tokenizer.decode(tokenizer.encode(text)))
except KeyError as e:
    print(f"KeyError: {e}")

KeyError: 'candidates'


## Adding special context tokens

Let's modify the above Tokenizer class to include two new tokens, namely `<|unk|>` and `<|endoftext|>`

In [15]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

7723


In [17]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('📈', 7718)
('📑', 7719)
('🧑\u200d⚖️', 7720)
('<|endoftext|>', 7721)
('<|unk|>', 7722)


In [18]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab # vocab text -> token IDs
        self.int_to_str = { i:s for s,i in vocab.items()} # inverse vocab token IDs -> original text tokens
    
    def encode(self, text):
        """Process input text into token IDs"""
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int 
                        else "<|unk|>" for item in preprocessed] # replace unknown words by <|unk|> tokens
        
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        """Convert token IDs back into text"""
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # replace spaces before the specified punctuations

        return text

In [19]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [20]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[7722, 15, 3918, 7722, 7722, 7722, 279, 7721, 7722, 7722, 7722, 7722, 7722, 7722, 7722, 18]


In [21]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do <|unk|> <|unk|> <|unk|>? <|endoftext|> <|unk|> <|unk|> <|unk|> <|unk|> <|unk|> <|unk|> <|unk|>.


Some others additional special tokens researchers adopted:

- [BOS] (beginning of a sequence)
- [EOS] (end of sequence)
- [PAD] (padding)

## Byte Pair Encoding (BPE)

BPE breaks down words into subword units, therefore `<|unk|>` becomes inconsequential.

It was used to train LLMs such as GPT-2, GPT-3, etc.

In [26]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.11.0


In [27]:
tokenizer = tiktoken.get_encoding("gpt2")

In [30]:
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [31]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


In [32]:
tokenizer.n_vocab

50257

Notes:

- `<|endoftext|>` token is assigned a relatively large token ID (50256)
- BPE tokenizer encodes/decodes unknown words (such as _someunknownPlace_) correctly. 

This is because BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handler out-of-vocabulary words.

In [40]:
for token in tokenizer.encode("Akwirw ier<Asd.a>"):
    print(token, '->' ,tokenizer.decode([token]))

33901 -> Ak
86 -> w
343 -> ir
86 -> w
220 ->  
959 -> ier
27 -> <
1722 -> As
67 -> d
13 -> .
64 -> a
29 -> >
