### Stage #1 - Building an LLM

- #### Step #1 - Data preparation and sampling

In [1]:
import os
import re

folder_name = 'llm-from-scratch'
file_name = 'the-verdict.txt'
print(f"{folder_name = }\n{file_name = }\n")

# Read the text/book.
with open(file=os.path.join(os.getcwd(), folder_name, file_name), mode='r', encoding='utf-8') as f:
    text_raw = f.read()

# Print the first 1000 characters of the text.
n_chars = 1000
print(f"Total number of characters in '{file_name}': {len(text_raw)}\n")
print(f"First {n_chars} characters of '{file_name}':\n\n{text_raw[:1000]}\n")

folder_name = 'llm-from-scratch'
file_name = 'the-verdict.txt'

Total number of characters in 'the-verdict.txt': 20479

First 1000 characters of 'the-verdict.txt':

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the l

##### __Difference between `pattern=r'(\s)'` and `pattern=r'\s+'` when used with `re.split`:__

__`pattern=r'\s+'`__
- __Meaning__: Matches one or more consecutive whitespace characters (spaces, tabs, newlines, etc.).
- __Behavior in__ `re.split`: Splits the string at every sequence of whitespace, and does not include the whitespace in the result.
- __Example__:
    ```python
    import re
    text = "Hello   world!\nHow are you?"
    print(re.split(r'\s+', text))
    # Output: ['Hello', 'world!', 'How', 'are', 'you?']
    ```

__`pattern=r'(\s)'`__
- __Meaning__: Matches a single whitespace character, and the parentheses create a capturing group.
- __Behavior in__ `re.split`: Splits the string at every single whitespace character, and includes the matched whitespace characters in the result as separate elements.
- __Example__:
    ```python
    import re
    text = "Hello   world!\nHow are you?"
    print(re.split(r'(\s)', text))
    # Output: ['Hello', ' ', '', ' ', '', ' ', 'world!', '\n', 'How', ' ', 'are', ' ', 'you?']
    ```

__Summary Table:__

| Pattern   | Splits on              | Includes whitespace in result? | Example Output                                      |
|-----------|------------------------|--------------------------------|-----------------------------------------------------|
| `r'\s+'`  | Any run of whitespace  | No                             | `['Hello', 'world!', 'How', 'are', 'you?']`         |
| `r'(\s)'` | Each whitespace char   | Yes (as separate elements)     | `['Hello', ' ', '', ' ', '', ' ', ...]`             |

__In short:__
- Use `r'\s+'` to split and discard whitespace.
- Use `r'(\s)'` to split and keep each whitespace character in the result.


In [2]:
# Create a sample of text.
n_chars = 1000
n_tokens = 40
text_sample = text_raw[:n_chars]
print(f"Sample of {n_chars} characters from '{file_name}':\n\n{text_sample}\n")

# Split at 'white-space' (\s) characters (excluding).
pattern = r'\s+'
# Split the text into tokens.
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n") # Print the first 10 tokens.

# Split at 'white-space' character (including).
pattern = r'(\s)'
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

# Split at 'white-space' characters (\s) and commas and period [,.] (excluding)
pattern = r'\s+|[,.]'
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

# Split at 'white-space' character (\s) and commas and period [,.] (including).
pattern = r'(\s+|[,.])'
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

Sample of 1000 characters from 'the-verdict.txt':

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not

##### __`pattern = r'\s+|([,.:;?_!"()\'-]|--)'`__

When the above pattern is used, `None` values are generated in the split result because the regex pattern uses a capturing group.

When a capturing group is used in `re.split`, the matched text for the group is included in the result. If a split occurs on the part that matches `\s+` (whitespace), the capturing group does not match anything. Hence, `re.split` inserts `None` in the result for that split.

__How to Fix?__
After splitting, filter out `None` values from your token list.

```python
import re

text = "Hello   world! How are you? -- I'm fine."
pattern = r'\s+|([,.:;?_!\"()\'-]|--)'
tokens = re.split(pattern, text)
# Remove None and empty strings
tokens = [tok for tok in tokens if tok not in (None, '')]
print(tokens)
# Output: ['Hello', 'world', '!', 'How', 'are', 'you', '?', '--', "I'm", 'fine', '.']
```

In [3]:
# Split at 'white-space' character (\s) (exclude them) and other special characters, like commas, period, etc. (include them).
pattern = r'\s+|([,.:;?_!"()\'-]|--)'
text_tokens = re.split(pattern=pattern, string=text_raw)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

# Remove None and empty strings.
text_tokens = [token for token in text_tokens if token not in (None, '')]
print(f"Number of tokens in the sample (after cleaning): {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample (after cleaning):\n{text_tokens[:n_tokens]}\n")

Pattern: r'\s+|([,.:;?_!"()\'-]|--)'
Number of tokens in the sample: 9341
First 40 tokens in the sample:
['I', None, 'HAD', None, 'always', None, 'thought', None, 'Jack', None, 'Gisburn', None, 'rather', None, 'a', None, 'cheap', None, 'genius', '-', '', '-', 'though', None, 'a', None, 'good', None, 'fellow', None, 'enough', '-', '', '-', 'so', None, 'it', None, 'was', None]

Number of tokens in the sample (after cleaning): 4863
First 40 tokens in the sample (after cleaning):
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '-', '-', 'though', 'a', 'good', 'fellow', 'enough', '-', '-', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had']



In [4]:
# Get unique tokens for the text.
n_token_ids = 20
unique_tokens = sorted(set(text_tokens))
print(f"Number of unique tokens in 'unique_tokens': {len(unique_tokens)}\n")
print(f"First {n_token_ids} unique tokens in 'unique_tokens':\n{unique_tokens[:n_token_ids]}\n")

# Create a dictionary of token ids.
vocab_token_ids = {
    token: token_id
    for token_id, token in enumerate(unique_tokens)
}
print(f"First {n_token_ids} items in 'vocab_token_ids':\n{list(vocab_token_ids.items())[:n_token_ids]}\n")

# Create a reverse dictionary of token ids.
reverse_vocab_token_ids = {
    token_id: token
    for token, token_id in vocab_token_ids.items()
}
print(f"First {n_token_ids} items in 'reverse_vocab_token_ids':\n{list(reverse_vocab_token_ids.items())[:n_token_ids]}\n")

Number of unique tokens in 'unique_tokens': 1139

First 20 unique tokens in 'unique_tokens':
['!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'A', 'Ah', 'Among', 'And', 'Are', 'Arrt', 'As', 'At', 'Be']

First 20 items in 'vocab_token_ids':
[('!', 0), ('"', 1), ("'", 2), ('(', 3), (')', 4), (',', 5), ('-', 6), ('.', 7), (':', 8), (';', 9), ('?', 10), ('A', 11), ('Ah', 12), ('Among', 13), ('And', 14), ('Are', 15), ('Arrt', 16), ('As', 17), ('At', 18), ('Be', 19)]

First 20 items in 'reverse_vocab_token_ids':
[(0, '!'), (1, '"'), (2, "'"), (3, '('), (4, ')'), (5, ','), (6, '-'), (7, '.'), (8, ':'), (9, ';'), (10, '?'), (11, 'A'), (12, 'Ah'), (13, 'Among'), (14, 'And'), (15, 'Are'), (16, 'Arrt'), (17, 'As'), (18, 'At'), (19, 'Be')]



Explanation for 
```python
text = re.sub(r'\s+([,.:;?_!"()\'-]|--)', r'\1', text)
```

- Pattern: `r'\s+([,.:;?_!"()\'-]|--)'`

    - `\s+` matches one or more whitespace characters (spaces, tabs, newlines, etc.).
    - `([,.:;?_!"()\'-]|--)` is a capturing group that matches any one of the listed punctuation marks or a double dash (--).

- Replacement: `r'\1'`
    - `\1` refers to the text matched by the capturing group (the punctuation).

- What it does:
    - __Finds__: Any whitespace that comes immediately before one of the listed punctuation marks.
    - __Replaces__: The whitespace and the punctuation with just the punctuation (removes the whitespace before punctuation).

- Example:
    ```python
    text = "Hello , world ! How are you ? -- I'm fine ."
    text = re.sub(r'\s+([,.:;?_!"()\'-]|--)', r'\1', text)
    print(text)
    # Hello, world! How are you?-- I'm fine.
    ```

- Summary:

    This line removes any whitespace that appears before punctuation, so punctuation is directly attached to the preceding word. This is useful for detokenizing text after splitting punctuation into separate tokens.

In [5]:
# Create a tokeniser class - version1.
class SimpleTokeniserV1:
    def __init__(self, vocab, pattern=r'\s+|([,.:;?_!"()\'-]|--)'):
        self.vocab = vocab
        self.reverse_vocab = {value: key for key, value in vocab.items()}
        self.pattern = pattern
    
    def encoder(self, text):
        tokens = re.split(pattern=self.pattern, string=text)
        tokens = [token for token in tokens if token not in (None, '')]

        token_ids = [self.vocab[token] for token in tokens]

        return token_ids
        
    def decoder(self, token_ids):
        text = ' '.join([self.reverse_vocab[token_id] for token_id in token_ids])
        # _tokens = [self.reverse_vocab[_token_id] for _token_id in token_ids]
        # _text = ' '.join(_tokens) # Join tokens with a space.
        
        # Remove white-space before special characters.
        text = re.sub(''.join(self.pattern.split('|', maxsplit=1)), r'\1', text)

        return text

In [6]:
# Test 'SimpleTokeniserV1' class.
# Create a sample of text.
text_sample = """
"My dear, since I've chucked painting people don't say that stuff about me--they say it about Victor Grindle," was his only protest, as he rose from the table and strolled out onto the sunlit terrace.
"""

# Create a tokeniser object.
tokeniser = SimpleTokeniserV1(vocab=vocab_token_ids)

# Encode the text.
text_encoded = tokeniser.encoder(text=text_sample)
print(f"Encoded text:\n{text_encoded}\n")

# Decode the text.
text_decoded = tokeniser.decoder(token_ids=text_encoded)
print(f"Decoded text:\n{text_decoded}\n")

Encoded text:
[1, 68, 325, 5, 903, 53, 2, 1071, 263, 752, 768, 365, 2, 979, 864, 996, 954, 118, 666, 6, 6, 1003, 864, 588, 118, 105, 42, 5, 1, 1087, 551, 735, 807, 5, 175, 535, 854, 483, 997, 980, 157, 949, 742, 736, 997, 965, 992, 7]

Decoded text:
" My dear, since I' ve chucked painting people don' t say that stuff about me-- they say it about Victor Grindle," was his only protest, as he rose from the table and strolled out onto the sunlit terrace.



In [8]:
n_token_ids = 10

# Create unique tokens and add tokens for unknown ('<|unk|>') and end-of-text (<|endoftext|>).
unique_tokens_all = sorted(set(text_tokens))
print(f"Number of unique tokens in 'unique_tokens_all': {len(unique_tokens_all)}\n")
unique_tokens_all.extend(['<|unk|>', '<|endoftext|>'])
print(f"Number of unique tokens in 'unique_tokens_all' afterwards: {len(unique_tokens_all)}\n")

# Create a dictionary of token ids.
vocab_token_ids_all = {
    token: token_id
    for token_id, token in enumerate(unique_tokens_all)
}
print(f"Last {n_token_ids} items in 'vocab_token_ids_all':\n{list(vocab_token_ids_all.items())[-n_token_ids:]}\n")

# Create a reverse dictionary of token ids.
reverse_vocab_token_ids_all = {
    token_id: token
    for token, token_id in vocab_token_ids_all.items()
}
print(f"Last {n_token_ids} items in 'reverse_vocab_token_ids_all':\n{list(reverse_vocab_token_ids_all.items())[-n_token_ids:]}\n")

Number of unique tokens in 'unique_tokens_all': 1139

Number of unique tokens in 'unique_tokens_all' afterwards: 1141

Last 10 items in 'vocab_token_ids_all':
[('year', 1131), ('years', 1132), ('yellow', 1133), ('yet', 1134), ('you', 1135), ('younger', 1136), ('your', 1137), ('yourself', 1138), ('<|unk|>', 1139), ('<|endoftext|>', 1140)]

Last 10 items in 'reverse_vocab_token_ids_all':
[(1131, 'year'), (1132, 'years'), (1133, 'yellow'), (1134, 'yet'), (1135, 'you'), (1136, 'younger'), (1137, 'your'), (1138, 'yourself'), (1139, '<|unk|>'), (1140, '<|endoftext|>')]



In [9]:
# Create updated tokeniser class - version2.
class SimpleTokeniserV2:
    def __init__(self, vocab, pattern=r'\s+|([,.:;?_!"()\'-]|--)'):
        self.vocab = vocab
        self.reverse_vocab = {value: key for key, value in vocab.items()}
        self.pattern = pattern
    
    def encoder(self, text):
        tokens = re.split(pattern=self.pattern, string=text)
        tokens = [token for token in tokens if token not in (None, '')]

        # Add unknown token for unknown tokens.
        tokens = [
            token if token in self.vocab else '<|unk|>' 
            for token in tokens
        ]
        token_ids = [self.vocab[token] for token in tokens]

        return token_ids
        
    def decoder(self, token_ids):
        text = ' '.join([self.reverse_vocab[token_id] for token_id in token_ids])
        
        # Remove white-space before special characters.
        text = re.sub(''.join(self.pattern.split('|', maxsplit=1)), r'\1', text)

        return text

In [10]:
# Test 'SimpleTokeniserV2' class.
# Create a sample of text.
text_01 = """Hello, world...!!!"""
text_02 = """Welcome to the world of LLMs."""

# Create 'end-of-text' token.
text_sample = ' <|endoftext|> '.join((text_01, text_02))
print(f"Text sample with 'end-of-text' token:\n{text_sample}\n")

# Create a tokeniser object.
tokeniser = SimpleTokeniserV2(vocab=vocab_token_ids_all)

# Encode the text.
text_encoded = tokeniser.encoder(text=text_sample)
print(f"Encoded text:\n{text_encoded}\n")

# Decode the text.
text_decoded = tokeniser.decoder(token_ids=text_encoded)
print(f"Decoded text:\n{text_decoded}")

Text sample with 'end-of-text' token:
Hello, world...!!! <|endoftext|> Welcome to the world of LLMs.

Encoded text:
[1139, 5, 1139, 7, 7, 7, 0, 0, 0, 1140, 1139, 1025, 997, 1139, 726, 1139, 7]

Decoded text:
<|unk|>, <|unk|>...!!! <|endoftext|> <|unk|> to the <|unk|> of <|unk|>.


Tokeniser - BPE (byte pair encoding).

In [11]:
# uv add tiktoken
import importlib
import tiktoken

print(f"{importlib.metadata.version('tiktoken') = }\n")

?tiktoken.get_encoding

importlib.metadata.version('tiktoken') = '0.9.0'



[31mSignature:[39m tiktoken.get_encoding(encoding_name: [33m'str'[39m) -> [33m'Encoding'[39m
[31mDocstring:[39m <no docstring>
[31mFile:[39m      ~/Study/github/python-examples/llm-from-scratch/.venv/lib/python3.11/site-packages/tiktoken/registry.py
[31mType:[39m      function

Example

In [12]:
# Create a tokeniser instance with 'tiktoken'.
tokeniser = tiktoken.get_encoding(encoding_name='gpt2')

In [13]:
# Define a sample text.
text_sample = """
Hello, world...!!! Welcome to the world of LLMs.
"""
# Encode the text.
text_encoded = tokeniser.encode(text=text_sample, allowed_special='all')
print(f"Encoded text with 'line-break':\n{text_encoded}\n")

text_sample = """
Hello, world...!!! <|endoftext|> Welcome to the world of LLMs <|endoftext|>.
"""
# Encode the text.
text_encoded = tokeniser.encode(text=text_sample, allowed_special='all')
print(f"Encoded text with 'line-break' and 'end-of-text':\n{text_encoded}\n")

text_sample = """Hello, world...!!! <|endoftext|> Welcome to the world of LLMs <|endoftext|>."""
# Encode the text.
text_encoded = tokeniser.encode(text=text_sample, allowed_special='all')
print(f"Encoded text without 'line-break' and with 'end-of-text':\n{text_encoded}")

Encoded text with 'line-break':
[198, 15496, 11, 995, 986, 10185, 19134, 284, 262, 995, 286, 27140, 10128, 13, 198]

Encoded text with 'line-break' and 'end-of-text':
[198, 15496, 11, 995, 986, 10185, 220, 50256, 19134, 284, 262, 995, 286, 27140, 10128, 220, 50256, 13, 198]

Encoded text without 'line-break' and with 'end-of-text':
[15496, 11, 995, 986, 10185, 220, 50256, 19134, 284, 262, 995, 286, 27140, 10128, 220, 50256, 13]


In [14]:
# Define a sample text.
text_sample = """
Hello, world...!!! <|endoftext|> Welcome to the world of LLMs.
"""

# Encode the text.
text_encoded = tokeniser.encode(text=text_sample, allowed_special='all')
print(f"Encoded text:\n{text_encoded}\n")

# Decode the text.
text_decoded = tokeniser.decode(tokens=text_encoded)
print(f"Decoded text:\n{text_decoded}")

Encoded text:
[198, 15496, 11, 995, 986, 10185, 220, 50256, 19134, 284, 262, 995, 286, 27140, 10128, 13, 198]

Decoded text:

Hello, world...!!! <|endoftext|> Welcome to the world of LLMs.



In [15]:
# Define random text.
text_sample = """
hsakhgkk 798796 ^%$jvja ":>>L)(*)
"""

# Encode the text.
text_encoded = tokeniser.encode(text=text_sample, allowed_special='all')
print(f"Encoded 'random' text:\n{text_encoded}\n")

# Decode the text.
text_decoded = tokeniser.decode(tokens=text_encoded)
print(f"Decoded 'random' text:\n{text_decoded}")

Encoded 'random' text:
[198, 11994, 11322, 70, 28747, 767, 4089, 41060, 10563, 4, 3, 73, 85, 6592, 366, 25, 4211, 43, 5769, 28104, 198]

Decoded 'random' text:

hsakhgkk 798796 ^%$jvja ":>>L)(*)



Create `input-target` pairs.

In [16]:
print(text_raw[:500])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it'


In [20]:
text_encoded = tokeniser.encode(text=text_raw, allowed_special='all')
print(f"Sample of encoded text:\n{text_encoded[:100]}\n")
print(f"Number of tokens in the text: {len(text_encoded)}\n")

print(f"{'=='*50}\n")
print(f"Sample of decoded text:\n{tokeniser.decode(tokens=text_encoded[:100])}\n")
print(f"Random sample of decoded text:\n{tokeniser.decode(tokens=text_encoded[50:100])}")

Sample of encoded text:
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11, 290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686, 41976, 13, 357, 10915, 314, 2138, 1807, 340, 561, 423, 587, 10598, 393, 28537, 2014, 198, 198, 1, 464, 6001, 286, 465, 13476, 1, 438, 5562, 373, 644, 262, 1466, 1444, 340, 13, 314, 460, 3285, 9074, 13, 46606, 536]

Number of tokens in the text: 5145


Sample of decoded text:
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the w

In [22]:
context_size = 4 # length of the input.

x = text_encoded[:context_size]
y = text_encoded[1:context_size + 1]

print(f"'x': {x}")
print(f"'y': \t{y}\n")

print(f"{'=='*50}\n")
for i in range(1, context_size + 1):
    x = text_encoded[:i]
    y = text_encoded[i]
    print(f"{x} ---> {y}")

print(f"{'=='*50}\n")
for i in range(1, context_size + 1):
    x = tokeniser.decode(tokens=text_encoded[:i])
    y = tokeniser.decode(tokens=[text_encoded[i]])
    print(f"{x} ---> {y}")


'x': [40, 367, 2885, 1464]
'y': 	[367, 2885, 1464, 1807]


[40] ---> 367
[40, 367] ---> 2885
[40, 367, 2885] ---> 1464
[40, 367, 2885, 1464] ---> 1807

I --->  H
I H ---> AD
I HAD --->  always
I HAD always --->  thought


Implement a `DataLoader`

In [23]:
# uv add torch torchvision
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, text, tokeniser, context_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenizes the entire text.
        token_ids = tokeniser.encode(text=text, allowed_special='all')

        # [i for i in range(0, (45 - 4), 4)]
        # [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40]

        # Uses a sliding window to chunk the book into overlapping sequences of max_length.
        for i in range(0, len(token_ids) - context_length, stride):
            input_chunk = token_ids[i:i + context_length]
            target_chunk = token_ids[i + 1:i + context_length + 1]
            self.input_ids.append(input_chunk)
            self.target_ids.append(target_chunk)

    # Returns the total number of rows in the dataset.
    def __len__(self):
        return(len(self.input_ids))
    
    # Returns a single row from the dataset.
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [None]:
# help(DataLoader)

In [25]:
def create_dataloader_v1(
        text, batch_size=4, context_length=256,
        stride=128, shuffle=True, num_workers=0
):
    # Initializes the tokenizer.
    tokeniser = tiktoken.get_encoding(encoding_name='gpt2')

    # Creates dataset.
    dataset = GPTDatasetV1(
        text=text, tokeniser=tokeniser,
        context_length=context_length, stride=stride
    )

    # Creates dataloader.
    dataloader = DataLoader(
        dataset=dataset, # dataset from which to load the data.
        batch_size=batch_size, # number of samples per batch to load.
        shuffle=shuffle,
        num_workers=num_workers, # number of subprocesses (CPU processes) to use for data loading.
        drop_last=True # drop the last incomplete batch, if the dataset size is not divisible by the batch size.
    )

    return dataloader

In [26]:
import torch

# Check PyTorch version.
print(f"PyTorch version: {torch.__version__}\n")

# Read the text/book.
with open(file=os.path.join(os.getcwd(), folder_name, file_name), mode='r', encoding='utf-8') as f:
    text_raw = f.read()

# Load the text into a DataLoader.
dataloader = create_dataloader_v1(
    text=text_raw, batch_size=1, context_length=4,
    stride=1, shuffle=False, num_workers=0
)

# Converts dataloader into a Python iterator to fetch the next entry via Python’s built-in next() function.
data_iter = iter(dataloader)
# list(data_iter)

# Fetch the first entry from the dataloader.
x, y = next(data_iter)

# Print the first batch.
print(f"First batch of input ids (x):\n{x}\n")
print(f"First batch of target ids (y):\n{y}\n")

# Fetch the second entry from the dataloader.
x, y = next(data_iter)

# Print the second batch.
print(f"Second batch of input ids (x):\n{x}\n")
print(f"Second batch of target ids (y):\n{y}\n")

PyTorch version: 2.7.0

First batch of input ids (x):
[tensor([40]), tensor([367]), tensor([2885]), tensor([1464])]

First batch of target ids (y):
[tensor([367]), tensor([2885]), tensor([1464]), tensor([1807])]

Second batch of input ids (x):
[tensor([367]), tensor([2885]), tensor([1464]), tensor([1807])]

Second batch of target ids (y):
[tensor([2885]), tensor([1464]), tensor([1807]), tensor([3619])]



In [27]:
# Define batch size.
batch_size, stride = 4, 1
print(f"{batch_size = }, {stride = }\n")
print(f"{'=='*50}\n")

dataloader = create_dataloader_v1(
    text=text_raw, batch_size=batch_size, context_length=4,
    stride=stride, shuffle=False, num_workers=0
)
data_iter = iter(dataloader)

# Print the first batch.
x, y = next(data_iter)
print(f"First batch of input ids (x):\n{x}\n")
print(f"First batch of target ids (y):\n{y}\n")

# Print the second batch.
x, y = next(data_iter)
print(f"Second batch of input ids (x):\n{x}\n")
print(f"Second batch of target ids (y):\n{y}\n")

batch_size = 4, stride = 1


First batch of input ids (x):
[tensor([  40,  367, 2885, 1464]), tensor([ 367, 2885, 1464, 1807]), tensor([2885, 1464, 1807, 3619]), tensor([1464, 1807, 3619,  402])]

First batch of target ids (y):
[tensor([ 367, 2885, 1464, 1807]), tensor([2885, 1464, 1807, 3619]), tensor([1464, 1807, 3619,  402]), tensor([1807, 3619,  402,  271])]

Second batch of input ids (x):
[tensor([1807, 3619,  402,  271]), tensor([ 3619,   402,   271, 10899]), tensor([  402,   271, 10899,  2138]), tensor([  271, 10899,  2138,   257])]

Second batch of target ids (y):
[tensor([ 3619,   402,   271, 10899]), tensor([  402,   271, 10899,  2138]), tensor([  271, 10899,  2138,   257]), tensor([10899,  2138,   257,  7026])]



Token/Vector embedding.

In [28]:
# uv add gensim
# from gensim.models import Word2Vec
import gensim.downloader as api

model = api.load(name='word2vec-google-news-300', return_path=True)
print(f"Model path: {model}\n")

Model path: /Users/shaz/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz



- #### Step #2 - Attention mechanism

- #### Step #3 - LLM architecture

### Stage #2 - Foundation model

- #### Step #1 - Training Loop

- #### Step #2 - Model evaluation

- #### Step #3 - Load pretrained weights

### Stage #3 - 

- #### Step #1 - `TBC`

- #### Step #2 - `TBC`