# Chapter 2 - Interactive

### This notebook will contain code blocks, images, and gifs to further enhance your understanding and intuition of specific topics listed below:

- #### tokenization + dataloading
- #### positional embeddings

## Tokenization:

#### Lets go through the entire process of tokenizing a piece of text and creating batches of data that you can feed into your GPT2 model.

#### First demonstrate what it means to turn words in a text into tokens.

#### Every word or special character has a number that corresponds to it,  as you can see below, and when you tokenize a text, all you are doing is replacing a word or special character with the corresponding number.

<div style="max-width:800px">
    
![](images/interactive_1.gif)

</div>

#### Create a sentence using only the words below, no punctuation. Include one word that is not explicity in the vocabulary.

In [7]:
vocabulary = {
    "hello": 1,
    "world": 2,
    "i": 3,
    "am": 4,
    "learning": 5,
    "tokenization": 6,
    "this": 7,
    "is": 8,
    "fun": 9,
    "<|UNK|>": 10,
}

In [14]:
YOUR_SENTENCE = "tokenization is so fun"

YOUR_TEXT = YOUR_SENTENCE.split(" ")
print(YOUR_TEXT)

['tokenization', 'is', 'so', 'fun']


#### Lets tokenize this text

In [9]:
YOUR_TEXT_TOKENIZED = []

for word in YOUR_TEXT:
    if word in vocabulary:
        YOUR_TEXT_TOKENIZED.append(vocabulary[word]) # If the word is in the vocab replace it with the corresponding number
    else:
        YOUR_TEXT_TOKENIZED.append(vocabulary["<|UNK|>"]) # If the word is not in the vocabulary replace it with 10 (corresponding number for unknown word)

print(YOUR_TEXT_TOKENIZED)

[6, 8, 10, 9]


#### Lets see how the words and numbers relate:

In [10]:
for index, token_id in enumerate(YOUR_TEXT_TOKENIZED):
    print(f"Word {index+1} is {YOUR_TEXT[index]}\nIn the vocabulary {YOUR_TEXT[index]} correlates to the number {token_id}\nSo, in the tokenized text it is {token_id}\n")

Word 1 is tokenization
In the vocabulary tokenization correlates to the number 6
So, in the tokenized text it is 6

Word 2 is is
In the vocabulary is correlates to the number 8
So, in the tokenized text it is 8

Word 3 is so
In the vocabulary so correlates to the number 10
So, in the tokenized text it is 10

Word 4 is fun
In the vocabulary fun correlates to the number 9
So, in the tokenized text it is 9



## Dataloading:

#### Remember, LLMs are pretrained by trying to predict the next word in a sequence of words.

<div style="max-width:800px">
    
![](images/interactive_2.gif)

</div>

#### Lets implement this in code. First, lets see your sentence and the tokenized version.

In [19]:
print(YOUR_TEXT)
print("\n",YOUR_TEXT_TOKENIZED)

['tokenization', 'is', 'so', 'fun']

 [6, 8, 10, 9]


#### Next lets create the input and target pairs. We want the first element in the inputs array and target array to correspond to eachother. Meaning, the first element in the targets array is the word we want the LLM to predict is next when fed the first word in the sequenec. The second element in the targets array should be the word the LLM predicts when fed the second word in the sequence, and so on.

In [22]:
inputs = []
targets = []

for i in range(len(YOUR_TEXT) - 1):
    inputs.append(YOUR_TEXT[i])
    targets.append(YOUR_TEXT[i+1])

#### Lets analyze what this code does. We have our forloop which will iterate through every word in the sequence, but we have the loop end before we get to the last word in the sequence. Why? 

#### Because, the LLM is supposed to predict the next word in the sequence, given a word. If we give it the last word in the sequence, there is nothing left to predict. Therefore, we stop once we get to the second to last word.

#### Lets see what is in our inputs and targets arrays:

In [24]:
print("Input words:", inputs)
print("Target words:", targets)

Input words: ['tokenization', 'is', 'so']
Target words: ['is', 'so', 'fun']


#### As you can see, its just like how I described above. 

#### The first element in inputs is the first word in the sequence. The first element in targets is the second word in the sequence.
#### The second element