# LLM Preparing Data
## Learning Goals

##Description
In this lab we will be preping our data to be consumed by our LLM.
Prereqs before the lab:
Build a Large Language Model (From Scratch) ch1 & ch2 (video or textformat)

### Citation
Raschka, S. (2024). Build a large language model (from scratch). Manning Publications.

### Lab Deliverables

Read though: https://github.com/rasbt/LLM-workshop-2024/blob/main/02_data/02.ipynb


# Step 1.
First we need to import our text file!
- Import a .txt file of your choosing. Project Gutenburg provides a number of plain text options.
  - For example take a few chapters from Mary Shelley's Frankenstine: https://www.gutenberg.org/cache/epub/84/pg84.txt
- Print the total number of characters in your text
<details>
  <summary>Click Here to view solution</summary>

```python
with open("frankenstine.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])
```
</details>

In [10]:
with open("frankenstein_sample.txt", "r", encoding="utf-8") as f:
    txt = f.read()

print("Total characters:", len(txt))
print(txt[:100])

Total characters: 438806
The Project Gutenberg eBook of Frankenstein; Or, The Modern Prometheus
    
This ebook is for the us


### Step 2. Tokenize the data
The next step involves splitting our text into individual word-level tokens. We need to be careful with punctuation and spacing to ensure we correctly isolate each part of the text.
- Import `re`, Python’s regular expression module.
- Preprocess the text using `re.split` with a regular expression.  
  A good starting pattern is:  
  `[,.:;?_!"()\']|--|\s`  
  This splits text into words, punctuation, and whitespace.
- Real-world text often contains additional characters or formatting, so you may need to adjust the regular expression or apply additional filtering for things like line breaks or tabs.
- Use a list comprehension (preferred) or a loop to remove empty strings or unwanted tokens that were not handled by the regular expression, such as tabs or line breaks.

<details>
  <summary>Click Here to view solution</summary>

```python
import re

# Creates tokens for words, punctuation, and whitespace
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)

# Filters out tabs and line breaks
preprocessed = [
    item for item in preprocessed
    if item and item not in {"\n", "\t"}
]

print(preprocessed[:38])

In [11]:
import re

words = re.split(r'([,.:;?_!"()\']|--|\s)', txt)
words = [w for w in words if w.strip() and w not in ['\n', '\t']]

print(len(words))
words[:25]


88440


['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Frankenstein',
 ';',
 'Or',
 ',',
 'The',
 'Modern',
 'Prometheus',
 'This',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'United',
 'States']

### Step 3. Filter out duplicate words
To build a vocabular for our LLM we want to get rid of any duplicates.
- A `set` is a list-like data structure in Python that does not allow duplicates.
- Converting a list to a set will automatically remove duplicate items.
- Use Python’s `set()` to remove duplicates from `preprocessed`, then use `sorted()` to sort the resulting list.
- Print the number of unique tokens in the vocabulary.

<details>
  <summary>Click Here to view solution</summary>

```python
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

In [12]:
vocab = sorted(list(set(words)))
print(len(vocab))

8196


# Step 4. Build vocab dictionary
While we understand words naturally, our models do not. To work with text, we need to encode words into numerical values. We will start by creating a dictionary that assigns an index value to each token, which will act as its ID or encoded representation.
- Use the `enumerate()` method on `all_words`. This will return a list of tokens and their indices.  
  If we had `['cat', 'dog']`, this would become:  
  `[(0, 'cat'), (1, 'dog')]`
- Create a variable called `vocab` that is a dictionary created from this list, where:
  - the key is the token
  - the value is the integer  
  Example: `{'cat': 0, 'dog': 1}`

<details>
  <summary>Click Here to view solution</summary>

~~~python
vocab = {token: integer for integer, token in enumerate(all_words)}
~~~

</details>

In [13]:
w2i = {w: i for i, w in enumerate(vocab)}

for i, (w, idx) in enumerate(w2i.items()):
    print(f"{w}: {idx}")
    if i > 8:
        break

!: 0
#84]: 1
$1: 2
$5: 3
(: 4
): 5
***: 6
,: 7
-: 8
.: 9


### Step 4. Tokenizer Class
Next we will need to make a class that can encode and decode our text.
- Create a class called `SimpleTokenizerV1`.

- Give the class two attributes:
  - `str_to_int`: a dictionary mapping tokens to integer IDs.
  - `int_to_str`: a reverse dictionary mapping integer IDs back to tokens.
    - This is created using `{i: s for s, i in vocab.items()}`.

- Create a method called `encode` that:
  - Takes a string as input.
  - Splits the text into tokens using punctuation and whitespace.
  - Removes whitespace-only tokens (spaces, tabs, and line breaks).
  - Converts each token into its corresponding integer ID.
  - Returns a list of integers.

- Create a method called `decode` that:
  - Takes a list of integer IDs as input.
  - Converts each ID back into its token.
  - Joins tokens into a single string using spaces.
  - Fixes spacing before punctuation.
  - Returns the reconstructed text.

<details>
  <summary>Click Here to view solution</summary>

```python
import re

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuation
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [14]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.w2i = vocab
        self.i2w = {v: k for k, v in vocab.items()}

    def encode(self, text):
        toks = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        toks = [t for t in toks if t.strip() and t not in ['\n', '\t']]
        return [self.w2i[t] for t in toks]

    def decode(self, ids):
        txt = " ".join([self.i2w[i] for i in ids])
        txt = re.sub(r'\s+([,.;:?!"()\'])', r'\1', txt)
        return txt

### Step 5. Testing the Tokenizer Class

Let’s make sure everything is working correctly.

- Instantiate the tokenizer class using the `vocab` dictionary and store it in a variable called `tokenizer`.
- Select a line of text from your source data.
  - The text **must** come from the same source used to build the vocabulary.
  - If the text contains words or symbols that are not in the vocabulary, encoding will fail.
- Run `.encode()` on the text.
- Print the encoded token IDs to verify the output.
- Call and print decode on the Id's to assure it's correct.

<details>
  <summary>Click Here to view solution</summary>

```python
tokenizer = SimpleTokenizerV1(vocab)

text = "Before this I was not unacquainted with the more obvious laws of electricity."
ids = tokenizer.encode(text)
print(ids)

In [15]:
tok = SimpleTokenizerV1(w2i)

s = "Before this I was not unacquainted with the more obvious laws of electricity."
enc = tok.encode(s)
print(enc)

[145, 7193, 393, 7728, 5118, 7414, 7866, 7160, 4954, 5173, 4550, 5188, 2911, 9]


In [16]:
tok.decode(enc)

'Before this I was not unacquainted with the more obvious laws of electricity.'

### Step 6. Something a Bit More Complex

We just walked through building our own simple tokenizer. Thankfully, there are existing tools that handle tokenization for us in a much more robust way.

In this step, we will use **tiktoken**, the tokenizer used by GPT-style models. Unlike our word-level tokenizer, tiktoken breaks text into **subword tokens**, allowing it to handle unknown words more gracefully.

- Create a new variable called `tokenizer` and set it using `tiktoken.get_encoding("gpt2")`.
- Create a string of text to encode (it can be any text).
- Call `tokenizer.encode()` and pass in the text.
- Print the resulting list of token IDs.

<details>
  <summary>Click Here to view solution</summary>

```python
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

text = (
    "Hello, do you like tea? In the sunlit terraces "
    "of someunknownPlace."
)

integers = tokenizer.encode(text)

print(integers)


In [17]:
import tiktoken

gpt2_tok = tiktoken.get_encoding("gpt2")

test = "Hello, do you like tea? In the sunlit terraces of someunknownPlace."

nums = gpt2_tok.encode(test)
nums

[15496,
 11,
 466,
 345,
 588,
 8887,
 30,
 554,
 262,
 4252,
 18250,
 8812,
 2114,
 286,
 617,
 34680,
 27271,
 13]

In [18]:
gpt2_tok.encode(s)

[8421,
 428,
 314,
 373,
 407,
 555,
 43561,
 14215,
 351,
 262,
 517,
 3489,
 3657,
 286,
 8744,
 13]

# Step 7. Testing decoding
Lest try it's decoding method
- Call tokenizer.decode and pass it our intagers
- print the string to assure it's correct
<details>
  <summary>Click Here to view solution</summary>

```python
strings = tokenizer.decode(integers)

print(strings)

In [19]:

gpt2_tok.decode(nums)

'Hello, do you like tea? In the sunlit terraces of someunknownPlace.'

# Step 8. Data Sampling
Now that we have tokens we need to create loading. Our LLMs aren't just trained on individule words but the sequence in which those words appear. We need to create batches of words for the model. A sliding "window" of the words in sequental order.
-  create_dataloader_v1 from supplementary
- call create_dataloader witht he folowing arguments, raw_tet, batch_size=8, max_length=4, stride=4, shuffle=False
  - raw_text was our data from the begging of this lab
  - batch_size is how many sequencies
  - max_length is the number of tokens in a sequence or the "window" size
  - strice is how far the window slides forward each time
  - shuffle=False keeps our sequences in order
  - create a varaible called data_iter and set it to iter(dataloader)
    - This allows us to move though the batches using next()
  - create a variable called inputs and targets and set them to next(data_iter)
  -print inputs and targets
  <details>
  <summary>Click Here to view solution</summary>

    ```python
    from supplementary import create_dataloader_v1


    dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

    data_iter = iter(dataloader)
    inputs, targets = next(data_iter)
    print("Inputs:\n", inputs)
    print("\nTargets:\n", targets)
    ```

In [20]:
from supplementary import create_dataloader_v1

dl = create_dataloader_v1(txt, batch_size=8, max_length=4, stride=4, shuffle=False)

it = iter(dl)
x, y = next(it)

print("Inputs:\n", x)
print("\nTargets:\n", y)

Inputs:
 tensor([[  464,  4935, 20336, 46566],
        [  286, 45738,    26,  1471],
        [   11,   383, 12495, 42696],
        [  198,   220,   220,   220],
        [  220,   198,  1212, 47179],
        [  318,   329,   262,   779],
        [  286,  2687,  6609,   287],
        [  262,  1578,  1829,   290]])

Targets:
 tensor([[ 4935, 20336, 46566,   286],
        [45738,    26,  1471,    11],
        [  383, 12495, 42696,   198],
        [  220,   220,   220,   220],
        [  198,  1212, 47179,   318],
        [  329,   262,   779,   286],
        [ 2687,  6609,   287,   262],
        [ 1578,  1829,   290,   198]])
