refrence: https://github.com/rasbt/LLMs-from-scratch/blob/d85ba93799fcac69607ecf9fa554d22d0821fe20/ch02/05_bpe-from-scratch/bpe-from-scratch-simple.ipynb

- used in models like GPT-2 to GPT-4,Llama 3 etc.
# 1. The main idea behind byte pair encoding (BPE)

- The main idea in BPE is to convert text into an integer representation (token IDs) for LLM 
training

## 1.1 Bits and bytes

- Before getting into the BPE algorithm, lets introduce the notion of bytes
-  consider converting the text into a byte array ( BPE stands for "byte" pair enconding after all)



In [1]:
text = "This is some text"

byte_arry = bytearray(text, "utf-8")

print(byte_arry)

bytearray(b'This is some text')


- when we call `list()` on a `bytearray` object, each byte is treated as individual element, and the result is a list of integers corresponding to the byte values:

In [2]:
ids = list(byte_arry)

print(ids)
len(ids)

[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116]


17

- This would be a valid way to convert text into a token id that we need for the embedding layer of an LLM
- However the downside of this approach is that it would create a single token id for the each charater ( That's a lot of IDs for the short text!)

- i.e., this means for a 17-character input text, we have to use 17 token IDs as input to LLM ( 14 characters and 3 spaces)

In [3]:
print("Number of characters:", len(text))
print("Number of token IDs:", len(ids))

Number of characters: 17
Number of token IDs: 17


- In BPE has vocabulary where token ID for whole words or subwords instead of each character.
- For example, GPT-2 tokenizer tokenizes the sae text into 4 token iDs instead of 17 

<br>
<img src = "/DATA/pyare/Routine/LLM/Reasoning/LLMs-from-scratch-pyare/chapter-2-tokenization/BPE_1.png">
</br>

In [4]:
import tiktoken

gpt2_tokenizer = tiktoken.get_encoding("gpt2")
gpt2_tokenizer.encode("This is some text")

[1212, 318, 617, 2420]

In [5]:
gpt2_tokenizer.encode("This is some test")

[1212, 318, 617, 1332]

- Since a Byte consist of 8 bits, there are $2^8$ = 256 possible values that a single byte can represented, ranging from 0 to 255.


In [6]:
bytearray(range(0, 257))

ValueError: byte must be in range(0, 256)

-  A BPE usually uses these 256 values as its first 256 single-character tokens,

In [7]:
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

for i in range(300):
    decoded = gpt2_tokenizer.decode([i])

    print(f"{i}: {decoded}")

0: !
1: "
2: #
3: $
4: %
5: &
6: '
7: (
8: )
9: *
10: +
11: ,
12: -
13: .
14: /
15: 0
16: 1
17: 2
18: 3
19: 4
20: 5
21: 6
22: 7
23: 8
24: 9
25: :
26: ;
27: <
28: =
29: >
30: ?
31: @
32: A
33: B
34: C
35: D
36: E
37: F
38: G
39: H
40: I
41: J
42: K
43: L
44: M
45: N
46: O
47: P
48: Q
49: R
50: S
51: T
52: U
53: V
54: W
55: X
56: Y
57: Z
58: [
59: \
60: ]
61: ^
62: _
63: `
64: a
65: b
66: c
67: d
68: e
69: f
70: g
71: h
72: i
73: j
74: k
75: l
76: m
77: n
78: o
79: p
80: q
81: r
82: s
83: t
84: u
85: v
86: w
87: x
88: y
89: z
90: {
91: |
92: }
93: ~
94: �
95: �
96: �
97: �
98: �
99: �
100: �
101: �
102: �
103: �
104: �
105: �
106: �
107: �
108: �
109: �
110: �
111: �
112: �
113: �
114: �
115: �
116: �
117: �
118: �
119: �
120: �
121: �
122: �
123: �
124: �
125: �
126: �
127: �
128: �
129: �
130: �
131: �
132: �
133: �
134: �
135: �
136: �
137: �
138: �
139: �
140: �
141: �
142: �
143: �
144: �
145: �
146: �
147: �
148: �
149: �
150: �
151: �
152: �
153: �
154: �
155: �
156: �
157: �
158:

- Above, note that entries 256 and 257 are not single-character values but double-character values (a whitespace + a letter), which is a little shortcoming of the original GPT-2 BPE Tokenizer (this has been improved in the GPT-4 tokenizer)

## 1.2 Building the vocabulary

The goal of the BPE tokenization algorithm is to build a vocabulary of commonly occurring subwords like 298: ent (which can be found in entangle, entertain, enter, entrance, entity, ..., for example), or even complete words like
318: is
617: some
1212: This
2420: text

 
## 1.3 BPE algorithm outline
**1. Identify frequent pairs**

- In each iteration, scan the text to find the most commonly occurring pair of bytes (or characters)

**2. Replace and record**

- Replace that pair with a new placeholder ID (one not already in use, e.g., if we start with 0...255, the first placeholder would be 256)
- Record this mapping in a lookup table
- The size of the lookup table is a hyperparameter, also called "vocabulary size" (for GPT-2, that's 50,257)

**3. Repeat until no gains**

- Keep repeating steps 1 and 2, continually merging the most frequent pairs
- Stop when no further compression is possible (e.g., no pair occurs more than once)
Decompression (decoding)

- To restore the original text, reverse the process by substituting each ID with its corresponding pair, using the lookup table

## 1.4 BPE algorithm example

### 1.4.1 Concrete example of the encoding part (step 1 & 2)

- suppose we have the text (training dataset) `the cat in the hat` from which we want to build the vocabulary for a BPE tokenizer.

**Iteration 1**

**1. Identify the frequent pairs**
        - In this text "th" appears twice ( at the begning and before the second "e")

**2. Replace and record**
        - replace the "th" with a new token ID that is not already in use, e.g., 256
        - the new text is `<256>e cat in <256>e hat`
        - the new vocabulary is

        ```
        0: ...
        ..
        256: "th"
        ```

**Iteration 2**

**1. Itedentify the frequent pairs**
- In the text `<256>e cat in <256>e hat`, the pair `<256>e` appears twice

- Replace `<256>e` with the new token ID that is not already in use, for example, `257`.

- The new text is:
<257> cat in <257> hat

- The updated vocabulary is:
```
0: ..
..
256:"th"
257: "<257>e"
```

**Iteration 3**

**1. Itedentify the frequent pairs**

- replace the "<257> " with the new token ID which is not in use, for example `258`

- the new text is:

<258>cat in <258>hat
- The updated vocabulary is:
```
0: ..
..
256:"th"
257: "<257>e"
258: "<257> "

```

- and so forth

### 1.4.2 Concrete example of the decoding part (step 3)

- To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced

- Start with the final compressed text: `<258>cat in <258>hat`
- Substitute `<258>` --> `<257> ` : `<257> cat in <257> hat`
- Substitute `<257>` --> `<257>e` : `<257>e cat in <257>e hat`
- Substitute `<256>` --> `<th` : `the cat in the hat`


# 2. A simple BPE implementation

- pyhton class that mimics the `tiktoken`
- Note that the encoding part above describes as the Original training step as `train()`; however, the `encode()` method works similarly (although looks more complicated due to the special token handling)

1.
2.
3.
4.


In [None]:
from collections import Counter, deque
from functools import lru_cache


class BPETokenizerSimple:

    def __init__(self):

        # Maps token_id to token string ( eg. {11246: "some"})

        self.vocab = {}

        # Maps token_str to token_id ( eg. {"some": 11246})

        self.inverse_vocab = {}

        # Dictionary of BPE merges : {(token_id1, token_id2): new_token_id}

        self.bpe_merges = {}

    def train(self, text, vocab_size, allowed_special={"<|endoftext|>"}):

        """

        Train the BPE tokenizer from scratch.

        Args:
            text (str): The training text.
            vocab_size (int): The size of the vocabulary to build.
            allowed_special (set): A set of special tokens to include in the vocabulary.
        """

        # preprocess: Replace spaces with "Ġ"
        # note that G is a particularity of the GPT-2 BPE implementation
        # E.g. "Hello world" might be tokenized as ["Hello", "Ġworld"]
        #( GPT-4 BPE tokenize it as ["Hello"," world"] )

        processed_text = []

        for i, char in enumerate(text):

            if char == " " and i != 0:
                processed_text.append("Ġ")

            if char != " ":
                processed_text.append(char)

        processed_text = "".join(processed_text)


        # initialize vocab with unique characters, including 'G' if present 
        # Start with the first 256 ASCII characters

        unique_chars = [chr(i) for i in range(256)]

        # Extend unique_chars with characters from processed_text that are not already included
        unique_chars.extend(char for char in sorted(set(processed_text)) if char not in unique_chars)

        # Optionally, ensure 'G' is included if it is relevant to your text processing
        if 'Ġ' not in unique_chars:
            unique_chars.append('Ġ')

        # Now create the vocab and inverse vocab dictionaries
        # continue ..



        
        
