### Stage #1 - Building an LLM

- #### Step #1 - Data preparation and sampling

In [14]:
# Import libraries
import os
import re
# import sys

In [13]:
folder_name = 'llm-from-scratch'
file_name = 'the-verdict.txt'
print(f"{folder_name = }\n{file_name = }\n")

# Read the text/book.
with open(file=os.path.join(os.getcwd(), folder_name, file_name), mode='r', encoding='utf-8') as f:
    text_raw = f.read()

# Print the first 1000 characters of the text.
n_chars = 1000
print(f"Total number of characters in '{file_name}': {len(text_raw)}\n")
print(f"First {n_chars} characters of '{file_name}':\n\n{text_raw[:1000]}\n")

folder_name = 'llm-from-scratch'
file_name = 'the-verdict.txt'

Total number of characters in 'the-verdict.txt': 20479

First 1000 characters of 'the-verdict.txt':

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the l

##### __Difference between `pattern=r'(\s)'` and `pattern=r'\s+'` when used with `re.split`:__

`pattern=r'\s+'`
- __Meaning__: Matches one or more consecutive whitespace characters (spaces, tabs, newlines, etc.).
- __Behavior in__ `re.split`: Splits the string at every sequence of whitespace, and does not include the whitespace in the result.
- __Example__:
    ```python
    import re
    text = "Hello   world!\nHow are you?"
    print(re.split(r'\s+', text))
    # Output: ['Hello', 'world!', 'How', 'are', 'you?']
    ```

`pattern=r'(\s)'`
- __Meaning__: Matches a single whitespace character, and the parentheses create a capturing group.
- __Behavior in__ `re.split`: Splits the string at every single whitespace character, and includes the matched whitespace characters in the result as separate elements.
- __Example__:
    ```python
    import re
    text = "Hello   world!\nHow are you?"
    print(re.split(r'(\s)', text))
    # Output: ['Hello', ' ', '', ' ', '', ' ', 'world!', '\n', 'How', ' ', 'are', ' ', 'you?']
    ```

__Summary Table:__

| Pattern   | Splits on              | Includes whitespace in result? | Example Output                                      |
|-----------|------------------------|--------------------------------|-----------------------------------------------------|
| `r'\s+'`  | Any run of whitespace  | No                             | `['Hello', 'world!', 'How', 'are', 'you?']`         |
| `r'(\s)'` | Each whitespace char   | Yes (as separate elements)     | `['Hello', ' ', '', ' ', '', ' ', ...]`             |

__In short:__
- Use `r'\s+'` to split and discard whitespace.
- Use `r'(\s)'` to split and keep each whitespace character in the result.


In [None]:
# Create a sample of text.
n_chars = 1000
n_tokens = 40
text_sample = text_raw[:n_chars]
print(f"Sample of {n_chars} characters from '{file_name}':\n\n{text_sample}\n")

# Split at 'white-space' (\s) characters (excluding).
pattern = r'\s+'
# Split the text into tokens.
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n") # Print the first 10 tokens.

# Split at 'white-space' character (including).
pattern = r'(\s)'
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

# Split at 'white-space' characters (\s) and commas and period [,.] (excluding)
pattern = r'\s+|[,.]'
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

# Split at 'white-space' character (\s) and commas and period [,.] (including).
pattern = r'(\s+|[,.])'
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

Sample of 1000 characters from 'the-verdict.txt':

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not

In [37]:
# Split at 'white-space' character (\s) (exclude them) and other special characters, like commas, period, etc. (include them).
pattern = r'\s+|([,.:;?_!"()\'-]|--)'
text_tokens = re.split(pattern=pattern, string=text_sample)
print(f"Pattern: r'{pattern}'")
print(f"Number of tokens in the sample: {len(text_tokens)}")
print(f"First {n_tokens} tokens in the sample:\n{text_tokens[:n_tokens]}\n")

Pattern: r'\s+|([,.:;?_!"()\'-]|--)'
Number of tokens in the sample: 465
First 40 tokens in the sample:
['I', None, 'HAD', None, 'always', None, 'thought', None, 'Jack', None, 'Gisburn', None, 'rather', None, 'a', None, 'cheap', None, 'genius', '-', '', '-', 'though', None, 'a', None, 'good', None, 'fellow', None, 'enough', '-', '', '-', 'so', None, 'it', None, 'was', None]



- #### Step #2 - Attention mechanism

- #### Step #3 - LLM architecture

### Stage #2 - Foundation model

- #### Step #1 - Training Loop

- #### Step #2 - Model evaluation

- #### Step #3 - Load pretrained weights

### Stage #3 - 

- #### Step #1 - `TBC`

- #### Step #2 - `TBC`