#### Chapter 2: Working with Text
###### Packages that are being used in this notebook:

In [None]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

###### This chapter covers data preparation and sampling to get input data "ready" for the LLM

#### 2.1 Understanding word embeddings
###### No code in this section
###### There are many forms of embeddings; we focus on text embeddings in this book
	
###### LLMs work with embeddings in high-dimensional spaces (i.e., thousands of dimensions)
###### Since we can't visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensional embedding space

#### 2.2 Tokenizing text
###### In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters
	
###### Load raw text we want to work with
###### The Verdict by Edith Wharton is a public domain short story

In [None]:
import os 
import urllib.request

if not os.path.exists("the-verdict.txt"):
   url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
   file_path = "the-verdict.txt" 
   urllib.request.urlretrieve(url, file_path)

###### (If you encounter an ssl.SSLCertVerificationError when executing the previous code cell, it might be due to using an outdated Python version; you can find more information here on GitHub)

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of characters:", len(raw_text))
print(raw_text[:99])

	
###### The goal is to tokenize and embed this text for an LLM
###### Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
###### The following regular expression will split on whitespaces

In [None]:
import re

text = "Hello, world. This is a test."
result = re.split(r'(\s)', text)

print(result)

###### We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [None]:
result = re.split(r'([,.]|\s)', text)

print(result)

###### As we can see, this creates empty strings, let's remove them