## Dataloaders

This notebook shows how text is converted to vectors representing the original text. It follows the notebook here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/dataloader.ipynb

In [None]:
import os
import re
import tiktoken
import torch
import urllib.request

## Tokenizing text

In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

In [None]:
# Load raw text
if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- The following regular expression will split on whitespaces

In [None]:
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [None]:
result = re.split(r'([,.]|\s)', text)

print(result)

This creates empty strings, let's remove them

In [None]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on