# Chapter 1 - Input to LLMs




## Introduction

LLM models take input as text and produce output as text. However, deep learning networks cannot work with text symbols. Hence the text must be represented in a continuous space. In this chapter, we will look at how to pre-process text input so it's malleable for a large language model to consume.

Figure 1 shows the basic blocks of this operation.
(reference:preprocessing)=
```{figure} ../../images/chapter1/input_creation.png
---
height: 150px
name: preprocessing
---
Input preprocessing blocks
```

Before we understand the nueances involved in each of these steps,
let us do a simple outline of what happens in each of these blocks using a no frill example.

(reference:nofrill)=
### No frill example

A causal large language model, also refered to autoregresive model is trained to do the next word prediction. From a vocabulary of words, given a sequence of words, the causal model predicts the most probable next word from the vocabulary. Causal model are trained on a bunch of documents. Documents are composed of words and each word is composed of characters. In our example, word will be lowest denomination of operation.We use the term corpus to refer to these input documents. The lowest denomination in this corpus is word, typically referred called tokens. The set of unique tokens in the corpus is called vocabulary.

Let us take a sample paragraph, our corpus for this exercise, and apply a regular expression to split the paragraph by whitespace or special characters. The resultant list of words forms the tokens for this corpus.

In [3]:
import re

# /* Text borrowed from https://www.tntech.edu/cas/physics/aboutphys/about-physics.php */
text_corpus = "Broadly, physics involves the study of everything in physical existence," + \
"from the smallest subatomic particles to the entire universe. Physicists try " + \
"to develop conceptual and mathematical models that describe interactions between entities " + \
"(both big and small) and that can be used to extend our understanding of how the "+ \
"universe works at different scales. Are you interested in studying physics?"


split_expr = r'([?.#$&*^@,)(]|\s)'
tokens = re.split(split_expr, text_corpus)
tokens = [token.strip() for token in tokens if len(token.strip()) > 0]
print(tokens)

['Broadly', ',', 'physics', 'involves', 'the', 'study', 'of', 'everything', 'in', 'physical', 'existence', ',', 'from', 'the', 'smallest', 'subatomic', 'particles', 'to', 'the', 'entire', 'universe', '.', 'Physicists', 'try', 'to', 'develop', 'conceptual', 'and', 'mathematical', 'models', 'that', 'describe', 'interactions', 'between', 'entities', '(', 'both', 'big', 'and', 'small', ')', 'and', 'that', 'can', 'be', 'used', 'to', 'extend', 'our', 'understanding', 'of', 'how', 'the', 'universe', 'works', 'at', 'different', 'scales', '.', 'Are', 'you', 'interested', 'in', 'studying', 'physics', '?']


:::{note}
In the above example we have a r before the string. This informs python interpreter to treat
blackslash as raw character and not as escape character.
:::

The set of unique tokens forms our vocabulary. Further we assign a unique id for each token.

In [4]:
vocabulary = {token:token_id for token_id, token in enumerate(set(tokens))}
print(vocabulary)

{'and': 0, 'interested': 1, 'to': 2, 'models': 3, 'small': 4, 'at': 5, 'scales': 6, 'universe': 7, ',': 8, 'entities': 9, 'subatomic': 10, 'that': 11, 'interactions': 12, 'be': 13, 'between': 14, 'particles': 15, 'our': 16, 'used': 17, 'can': 18, '.': 19, 'conceptual': 20, 'of': 21, 'involves': 22, 'everything': 23, 'entire': 24, 'smallest': 25, 'big': 26, 'understanding': 27, 'works': 28, 'mathematical': 29, 'studying': 30, 'Physicists': 31, 'the': 32, 'physical': 33, '(': 34, 'different': 35, 'study': 36, ')': 37, 'describe': 38, 'how': 39, 'try': 40, 'physics': 41, 'develop': 42, 'you': 43, 'existence': 44, '?': 45, 'Broadly': 46, 'from': 47, 'both': 48, 'in': 49, 'extend': 50, 'Are': 51}


With this vocabulary we can now encode any input string into a list of integers / token ids.

:::{note}
Size of the vocabulary plays a great part in building the LLM. The challenge is to have a compact vocabulary and
still try to cover maxium amount of tokens in the corpus. We will discuss
more in this chapter about modeling exercise to build a compact vocabulary. The illustration given here is a very
simplified example.
:::

In [5]:
input_text = "universe works at different scales"

tokens = re.split(split_expr, input_text)
tokens = [token.strip() for token in tokens if len(token.strip()) > 0]
encoded_input = [vocabulary[token.strip()] for token in tokens]
print(encoded_input)

[7, 28, 5, 35, 6]


We have succesfully converted our word tokens into token id. Though this is now in number space, neural networks cannot process it. We need the input in a continous space. Here is where word embedding comes in handy. Let us build a embedding lookup table. The keys of this look up table are our integer word token ids. The value are a continous representation.


In [6]:
import numpy as np

vocab_size = max(vocabulary.values())
print(f"vocabulary size {vocab_size}")

embedding_size = 5
word_embedding = np.random.uniform(size=(vocab_size, embedding_size))
print(f"Word Embedding shape {word_embedding.shape}")
print(word_embedding[0:5,0:5])

vocabulary size 51
Word Embedding shape (51, 5)
[[0.79061896 0.23025439 0.55998829 0.8374295  0.40650912]
 [0.28080375 0.6848087  0.3396529  0.17248223 0.29623326]
 [0.07165954 0.76513322 0.47325477 0.7382186  0.76346754]
 [0.90756471 0.14178423 0.65096936 0.23639402 0.67450719]
 [0.37767153 0.42207631 0.68086117 0.59773189 0.78474626]]


Here we build an embedding lookup table. Our embedding dimension is set to 5. We create a look up table where rows represent the token and 
the columns represent the embedding for those words. The embeddings are random real numbers representing the words in a continous space.

In [7]:
input_embedding = word_embedding[encoded_input,:]
print(input_embedding.shape)
print(input_embedding[0:5, 0:5])

(5, 5)
[[0.85432497 0.33774429 0.5163072  0.16599111 0.77619878]
 [0.8367271  0.02581914 0.05084087 0.15658079 0.86089508]
 [0.25121756 0.13697472 0.44487938 0.13113087 0.51111093]
 [0.33662536 0.6617542  0.16918677 0.71635738 0.5714265 ]
 [0.51201298 0.95930028 0.74252266 0.45792642 0.87873272]]


Word positions carry semantic information. In addition to the words, providing the position of the words
will be benefial to the model. Similar to word embedding, we will create a look up for the position embedding.
Let us assume a simple case here. The input size to our LLM is fixed, say 10. We will call it as the sequence length.


In [8]:
sequence_length = 10
position_embedding_lookup = np.random.uniform(size=(sequence_length, embedding_size))

position_index =  np.arange(input_embedding.shape[0])
position_embedding = position_embedding_lookup[position_index, :]

position_embedding[0:2, 0:5]

array([[0.5235984 , 0.60912799, 0.66000757, 0.87457871, 0.50215997],
       [0.28703682, 0.38816409, 0.02410332, 0.85221138, 0.70862048]])

The embedding size is same as the word embedding. Finally we can now add the position embedding to word embedding



In [9]:
final_embedding = input_embedding + position_embedding

Typical of any deep learning model, feature values X and label value Y are fed into Large language model. The main job of a casual model is to predict the next given word. 


Let us see how we can quickly prepare the input X and the label Y for our LLM.


In [15]:
tokens = re.split(split_expr, text_corpus)
tokens = [token.strip() for token in tokens if len(token.strip()) > 0]
token_encoding = [vocabulary[token] for token in tokens]

feature_batch = []
label_batch = []

slide = 1
for idx in range(len(tokens) - sequence_length ) :
    feature = token_encoding[idx:idx + sequence_length]
    label =   token_encoding[idx + slide: idx + slide + sequence_length]

    feature_batch.append(feature)
    label_batch.append(label)

print(f"a feature : {feature_batch[0]}")
print(f"a label   : {label_batch[0]}")


a feature : [46, 8, 41, 22, 32, 36, 21, 23, 49, 33]
a label   : [8, 41, 22, 32, 36, 21, 23, 49, 33, 44]


Givent the token id 7, we want the LLM to predict 35, now given 35 we want it to predict 16 and so on. By sliding the feature 1 level to the right, we get the token ids for the labels. 


:::{note}
Sliding is a design decision. For demonstration purpose we have used a slide of 1. This may lead to overfitting in some cases.
:::

With these we can further get the embeddings throught he lookup table we have created.



Hopefully this gives a summary of all the steps involved in preparing the input for a LLM. 

:::{note}
The examples are trivialized in this chapter. The goal is to understand the datapipeline.In realworld the pipelines are much complex. To quote from llama3 description,

"Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources. Our training dataset is seven times larger than that used for Llama 2, and it includes four times more code. To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages. "

writing a pipline to process fifteen trillion tokens is a work of a large team of data engineers with sophisticated hardware.

"To ensure Llama 3 is trained on data of the highest quality, we developed a series of data-filtering pipelines. These pipelines include using heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality. We found that previous generations of Llama are surprisingly good at identifying high-quality data, hence we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3."

Further to ensure good quality training data a lot of pre-processing work needs to be performed. Covering all of htem is outside the scope of this work. However we have touched upon some of the essential pre-processing topics.

https://ai.meta.com/blog/meta-llama-3/

:::

## A Sample Text corpus

Publicly and privately available LLMs leverage the text data available in world wide web to do the pre-training. In the no frill section, we showed how the features and labels needed to train an LLM comes from the same source, sliding the features leaves us with the label. This can done in an unsupervised manner, saving the labor needed to create large training dataset. In the GPT-1 paper {cite}`radford2018improving`, the authors call this training process as unsupervised pre-training. GPT-1 was trained with Bookcorpus dataset {cite}`zhu2015aligning`.


Loading input text from desparate sources is a tedious undertaking. LLMs are trained on Terra Bytes of data. Complex data pipelines are orchestrated to
extract and validate the data. Details of those pipelines are beyond the scope of the book. Here is a quote from {cite}`touvron2023llama`, "Our training corpus includes a new mix of data from publicly available sources, which does not include data
from Meta’s products or services. We made an effort to remove data from certain sites known to contain a
high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this
provides a good performance–cost trade-off, up-sampling the most factual sources in an effort to increase
knowledge and dampen hallucinations."  


    
To give an idea about loading the corpus, we will use Simplebooks {cite}`nguyen2019simplebooks`. After downloading the dataset, we will show how to leverage hugginface's dataset libary to load the dataset.

In [335]:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

def download_simplebooks(destination: str) -> None:
    """
    Download simple books dataset
    """
    url = "https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip"
    http_response = urlopen(url)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path=destination)
    print(f"Finished downloading and extracting simplebooks.zip")
                      
    
download_simplebooks()

Finished downloading and extracting simplebooks.zip


### Structure of simplebooks

From project gutenberg, 1573 books were selected, mostly children book and simplebooks dataset was created.
Simplebooks, when downloaded comes with datasets in two sizes.Simplebooks-2 is of size 11MB with a vocabulary size of 11,492 and Simplebooks-92
of size roughly 400MB with a vocabulary size of 98,304. Simplebooks-2 has 2.2 M tokens. Compared to llama-2 which uses 2 trillion tokens, Simplebooks
is a small dataset which can be used write code to study LLMs.

    !ls ../data/simplebooks
    README.md  simplebooks-2  simplebooks-2-raw  simplebooks-92  simplebooks-92-raw

Both simplebooks-2 and simplebooks-92 has folders with raw suffix. The raw suffixed folders have the data with no changes from gutenberg source. The following normalization were performed on raw suffixed folders and the results are in non raw suffixed folders. 

1. Spacy was used to tokenize each book. Original case and punctuations were preserved.
2. @ was added as separator for numbers. So 300,000 becomes 300 @,@ 000.

Each of the folder have train, test and validation split and vocabulary files.

    !ls ../data/simplebooks/simplebooks-2
    test.txt  train.txt  train.vocab  valid.txt
    
simplebooks-2 and simplebooks-92 have the cleaned up data. The vocabulary built after applying pre-tokenization on the normalized text is also stored. A quick peek at the train files should show the difference.


In [130]:
!head -n 10 ../../data/simplebooks/simplebooks-2/train.txt


More <unk> Tales

By

Ellen C. <unk>



I



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [24]:
!head -n 15 ../data/simplebooks/simplebooks-2-raw/train.txt


More Jataka Tales

By

Ellen C. Babbitt



I

The Girl Monkey And The String Of Pearls


One day the king went for a long walk in the woods. When he came back to his own garden, he sent for his family to come down to the lake for a swim.



As a part of pre-tokenization some of the uknown words like Jataka, Babbitt, are replaced by a token "<unk>". More about special tokens later in this chapter. Unnecessary white space are removed, look at the sentence
"own garden , " is cleaned up to "own garden," Let us peek into the vocabulary creaated.

In [44]:
vocabulary = {}
with open("../data/simplebooks/simplebooks-2/train.vocab") as f:
    rows = ( line.split('\t') for line in f )
    count = 0
    for row in rows:
        vocabulary[row[0]] = int(row[1].strip())
        count+=1
        
print(f"Entries in vocabulary {count}")
print(f"sample tokens {list(vocabulary.keys())[0:5]}")
print(f"their encodings {list(vocabulary.values())[0:5]}")

Entries in vocabulary 11493
sample tokens [',', '.', 'the', '"', 'and']
their encodings [131695, 105703, 98932, 97156, 63612]


Hopefully this gives an idea about input text corpus.

## Tokenization Pipeline

The tokenization begins with raw input text source / corpus and ends with a dictionary of tokens and their associated token ids. Token ids are integers. After this given a new text, the pipeline should be able to spit out the associated tokens. Similarly, given a list of tokens, the pipeline should be able to convert it back to text without any loss. The below figure illustrates the various steps involved in this pipeline.

(reference:tokenization)=
```{figure} ../../images/chapter1/TokenEncoding.jpg
---
height: 250px
name: encoding
---
Steps in Tokenization
```

### Normalization

In the simplebooks example, we saw that unncessary whitespaces were removed and numbers were formatted by inserting '@' at different separators. 
Typicall normalization involves removing unncessary whitespaces, stripping of accents, lower case conversion and similar others. Here is a list of some normalizer s provided by [HuggingFace Tokenizer library](https://huggingface.co/docs/tokenizers/en/components).

1. Unicode normalization (NFD, NFKD, NFC and NFKC algorithms)
2. Lowecase conversion
3. Stripping white spaces and accents
4. Replacing common string patterns

:::{admonition} Unicode normalization

Unicode encoding involves assigning a numerical value called "code point" to each character and transforming them into a series of bytes.
Issues may arise when a character can be represented by a single code point or a combination of two code points. Unicode normalization
is the process of normalizing a unicode encoded string into a canonical form.

For the more curious please read the [article](https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/) to get a little history about ASCII, latin-1, unicode.

:::

Quoting from GPT-1 paper {cite}`radford2018improving`, "We use the ftfy library2 to clean the raw text in BooksCorpus, standardize some punctuation and whitespace, and use the spaCy tokenizer".

1. [ftfy - fixes text for you](https://ftfy.readthedocs.io/en/latest/index.html)
2. [Spacy](https://spacy.io/)

Going into the details of ftfy and spacy is beyond the scope of this book. Following code snippets demonstrates the basic usage of these packages. We will discuss Spacy in pre-tokenization section.

In [20]:
import ftfy

ftfy.fix_text("L&AMP;AMP;ATILDE;&AMP;AMP;SUP3;PEZ")

'LóPEZ'

The string "L&AMP;AMP;ATILDE;&AMP;AMP;SUP3;PEZ" is converted to LoPEZ by ftfy. This package can take care of issues with character decoding. Let us look
at a spacy example. After installing spacy, download the tokenizer model to run the following code snipped.

    conda install ftfy spacy
    python -m spacy download en_core_web_sm

`````{admonition} Mojibake (文字化け, "Garbled") 
:class: tip
Garbled text formed as a result of being decoded using a character encoding with which it was not orignally encoded. 
A funny poem about Mojibake related to characters printed in a shippling label.

(reference:mojibake)=
Figure 2 A funny mojibake poem

```{figure} ../images/chapter1/shipping-label.png
---
height: 150px
name: shipping-label
---
Mojibake shipping label
```
ODE TO A SHIPPING LABEL
Once there was a little o,
with an accent on top like so

It started out as UTF8,
but the program only knew latin1,
and changed the litte o to A for fun.

and it goes on. For the complete [poem](https://imgur.com/4J7Il0m)

The text in the label is lopez and due to wrong decoding we have a Mojibake. 

`````

### Pre-tokenization


Using a set of rules, the text is split into atomic units, tokens. Imaging this as a superset of tokens fed into the vocabulary building exercise. A subset of these tokens make their way into the final vocabulary. An example pre-tokenizer is  a simple whitespace tokenizer. If two words are separated by a whitespace, they will be treated as two tokens.
We saw an example of this in the no frill section. Let us write some python code to implement what we have learnt.

In [2]:
import re
import ftfy
import spacy

class SpacyTokenizer():
    """
    Tokenizer based on Spacy library
    """
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")

    def __call__(self, input_text):

        assert len(input_text) > 0

        doc = nlp(input_text)
        tokens = [token.text for token in doc]
        tokens = [token.strip() for token in tokens if len(token.strip()) > 0]

        return tokens
        


class RegexTokenizer():
    """
    Regex Based Tokenizer
    Splits text by eitehr whitespace or by one of these
    special characters,?.#$&*^@,
    """
    def __call__(self, input_text):

        assert input_text is not None
        assert len(input_text) > 0

        tokenizer_regex = r'([?.#$&*^@,)(]|\s)'
        tokens = re.split(tokenizer_regex, input_text)
        tokens = [token.strip() for token in tokens if len(token.strip()) > 0]

        return tokens
        


ModuleNotFoundError: No module named 'ftfy'

The regex based tokenizer, uses the regex expression we introduced in no frills section. Spacy tokenizer uses
the Spacy library to tokenize. Let us take a sample from our Simplebooks dataset to see these tokenizers in action.

In [1]:
def read_simplebooks(path):
    for line in open(path, 'r'):
        yield line

simplebooks_reader = read_simplebooks('../data/simplebooks/simplebooks-2-raw/train.txt')
SAMPLE_SIZE = 6

tokenizer = RegexTokenizer()


simple_books_sample = [tokenizer(line) for idx, line in enumerate(simplebooks_reader) if idx <= SAMPLE_SIZE and len(line) > 1]
simple_books_sample

NameError: name 'RegexTokenizer' is not defined

With a sample of 15 sentences we are ready to pass it to our tokenizer.

### Tokenizer models - Dictionary Training

One may wonder the need for any subsequent processing in tokenization pipeline. The pre-tokenization output can be used directly to build a Vocabulary. The set of unique tokens gathered after running the tokenizer over the input corpus is the vocabulary. The transformerXL model {cite}`dai2019transformerxl` has a vocabulary size of 250K, compared to Llama which has a size of 32K

:::{note}
TransformerXL uses space and punctuation to tokenize the text. Their vocabulary size is around 250K. Here is the link
to their paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
::: 


A compact vocabulary reduces the model complexity and computation needs to train and perform inference. Special tokens and respective token-ids are added for unknown words. Say an input to the language model contains a word not present in the vocabulary, it will be treatd as unknown and the token id assigned for unknown word will be substituted. A good token encoding pipeline should strive to reduce the number of unknown words. Compact vocabulary and reduced unkown words are two opposite contraints.

Word based tokenization suffers from very large vocabulary size and large number of out of vocabulary tokens.
Character based tokenization suffers from very large sequences and less meaningful individual tokens.

The pre-tokenization leaves us with a superset of all the tokens. The Dictionary training phase involves applying an algorithm to finalize the subset of tokens from this superset to be used for encoding.


```{note}
The tokenizer which uses recursive rules to produce vocabulary are commonly called as sub-word tokenizers.
```

Dont get it confused by machine learning tranining process. By train, this method is suppose to use a bunch of rules to produce an optimum dictionary. 
Using rules, the tokens are further split to form a compact vocabulary, at the same time reduce the chances of having unknown token ids. 

The most commonly used dictionary training approaches are

1. BPE - Byte Pair Encoding
2. WordPiece
3. SentencePiece
4. Unigram




```{admonition} Character Level Encoding
The two biggest challenge with word-level tokenization and are the size of the vocabulary and the number of unknown tokens added as a part of encoding.
The vocabulary size has to be very large to decrease the number of uknown token, however it does not guarantee great reduction of unknown tokens. Words are based on characters, how about we tokenize the individual characters and use an encoding for each character?

    input_corpus_encoded = [ord(character) for character in text_corpus]
    print(input_corpus_encoded)
    assert len(text_corpus) == len(input_corpus_encoded)
    
    [76, 97, 114, 103, 101, 32, 108, 97, 110, 103, 117, 97, 103, 101, 32, 109, 111, 100, 101, 108, 115, 44, 32, 116, 104, 101, 32, 110, 101, 119, 32, 107, 105, 100, 32, 105, 110, 32, 116, 104, 101, 32, 98, 108, 111, 99, 107, 32, 105, 115, 32, 99, 114, 101, 97, 116, 105, 110, 103, 32, 119, 111, 110, 100, 101, 114, 115, 46, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 78, 117, 109, 101, 114, 111, 117, 115, 32, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 115, 32, 104, 97, 118, 101, 32, 115, 112, 97, 110, 110, 101, 100, 32, 105, 110, 32, 116, 104, 101, 32, 108, 97, 115, 116, 32, 116, 119, 111, 32, 121, 101, 97, 114, 115, 32, 108, 101, 118, 97, 114, 97, 103, 105, 110, 103, 32, 108, 108, 109, 115, 46]

    decoded_input = [chr(token_id) for token_id in input_corpus_encoded]
    print("".join(decoded_input))
    
    Large language models, the new kid in the block is creating wonders.                 Numerous applications have spanned in the last two years levaraging llms.

**Challenges with character level encoding**

The context of thw words are lost while doing character level encoding. They may be suitable for small toy llm's for unusable for building systems of any practical use.

```


In this book we will cover Byte pair encoding. Curious readers can go through https://huggingface.co/docs/transformers/en/tokenizer_summary to get a summary of other subword algorithms.



#### Bye pair encoding

Byte pair encoding was first introduced for word segmentation in the paper Neural Machine Translation of Rare Words with Subword Units {cite}`sennrich2016neural`. It is a sub-word level method. The original algorithm is attribtued to Philip Gage. 1994. A New Algorithm for Data Com-pression. C Users J., 12(2):23–38, February. It is a data compression algorithm working iteratively. Say for example, we have the following string

aaaabdaaabac

Iteratively, let us now replace the most frequent pairs with another symbol not present in the string. For example, we replace the pair 'aa' with Z. The new
string will be ZabdZabac. 'Za' is the most frequently occuring pair now. Let us replace it with X. We continue this way till the string reaches the desired size.
Below is the python code to demonstrate this iteration.

In [45]:
def bpe_compression(input_str, desired_size=5):
    iterations = 0
    replace = ['Z','X','Y','L']
    replacements = []
    while len(input_str) > desired_size:
        iterations+=1
        pairs = Counter([i + j for i, j in zip(input_str, input_str[1:])])
        pair,freq = pairs.most_common(1)[0]
        if freq <=1 :
            break
        old_str = input_str
        input_str = input_str.replace(pair, replace[iterations - 1])
        replacements.append((replace[iterations -1], pair))
        print(f"Iteration {iterations} \n old {old_str} \n new {input_str} \n {pair} replaced by {replace[iterations-1]}")
    return input_str, replacements

input_str, replacements = bpe_compression('aaabdaaabac')
        

Iteration 1 
 old aaabdaaabac 
 new ZabdZabac 
 aa replaced by Z
Iteration 2 
 old ZabdZabac 
 new XbdXbac 
 Za replaced by X
Iteration 3 
 old XbdXbac 
 new YdYac 
 Xb replaced by Y


In order to reconstruct this, we store all the replacements in a stack.

In [46]:
print(replacements)

[('Z', 'aa'), ('X', 'Za'), ('Y', 'Xb')]


Now we can pop up the replacements from the stack and retrieve the original string.

In [48]:
while len(replacements) > 0:
    replace_str, _str = replacements.pop()
    print(replace_str, _str)
    input_str = input_str.replace(replace_str, _str)
    print(input_str)

BPE begins with the output from pre-tokenizer. For each token, a map of token with its constituent characters followed by a end of word symbol and its frequency in the input corpus are retrieved.

Let us see the example from {cite}`sennrich2016neural`.



In [3]:
vocab  = {'l o w </w>' : 5, 'l o w e r </w>' : 2, 'n e w e s t </w>':6, 'w i d e s t </w>':3}

So given a token 'l o w </w>', we get the list of subsequent character pairs and their frequency. In this case it will be

'l o', 'o w' and 'w </w>'. Since 'l o' occurs in 'l o w' and 'l o w e r',its frequency will be 7. 

In [4]:
from collections import defaultdict

def get_freq_pairs():
    pairs = defaultdict(int)
    for token,frequency in vocab.items():
        symbols = token.split()
        for i in range(len(symbols) -1):
            pair = (symbols[i],symbols[i+1])
            pairs[pair]+=frequency
    return pairs

pairs = get_freq_pairs()
pairs

defaultdict(int,
            {('l', 'o'): 7,
             ('o', 'w'): 7,
             ('w', '</w>'): 5,
             ('w', 'e'): 8,
             ('e', 'r'): 2,
             ('r', '</w>'): 2,
             ('n', 'e'): 6,
             ('e', 'w'): 6,
             ('e', 's'): 9,
             ('s', 't'): 9,
             ('t', '</w>'): 9,
             ('w', 'i'): 3,
             ('i', 'd'): 3,
             ('d', 'e'): 3})

In [5]:
best = max(pairs, key=pairs.get)
best = " ".join(best)

best

'e s'

In this iteration we have selected 'es' as the best pair. Now let us rebuild our vocabulary with this newly found frequencies. This is the merge operation. 

In [6]:
def merge(best, vocab_in):
    new_vocab = defaultdict(int)

    for token,freq in vocab_in.items():
        if best in token:
            best_concated = best.replace(" ","")
            token = token.replace(best, best_concated)
        new_vocab[token] = freq

    return new_vocab

vocab = merge(best, vocab)
print(vocab)

defaultdict(<class 'int'>, {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3})


As you can see from the above output, 'e' and 's' are now merged as 'es'. We have the output from the first iteration. Similary we can run multiple iteration. In the below example, we run the iterations 5 times. Every time we get the most frequent pair, find the best pair, the one with highest frequency, perform the merge oepration and simulataneously store the merges as rules.

In [7]:
vocab  = {'l o w </w>' : 5, 'l o w e r </w>' : 2, 'n e w e s t </w>':6, 'w i d e s t </w>':3}
rules = []
rule_number = 1
for i in range(5):
    pairs = get_freq_pairs()
    best = max(pairs, key=pairs.get)
    rules.append((rule_number," ".join(best), "".join(best)))
    rule_number+=1
    best = " ".join(best)
    vocab = merge(best, vocab)

print(vocab)
print(rules)

defaultdict(<class 'int'>, {'low </w>': 5, 'low e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3})
[(1, 'e s', 'es'), (2, 'es t', 'est'), (3, 'est </w>', 'est</w>'), (4, 'l o', 'lo'), (5, 'lo w', 'low')]


With the updated frequency and rules to merge, we can finally create our subword dictionary.

In [10]:
token_id = 0
final_vocab = {}
for token,freq in vocab.items():
    symbols = token.split()
    for symbol in symbols:
        if symbol not in final_vocab.keys():
            token_id+=1
            final_vocab[symbol] = token_id

print(final_vocab)

{'low': 1, '</w>': 2, 'e': 3, 'r': 4, 'n': 5, 'w': 6, 'est</w>': 7, 'i': 8, 'd': 9}


Our final vocabulary is ready. You can compare it with our pre-tokenization frequence table. We had words like lower, lowest in our pre-tokenization dictionary. We now have a compact subword vocabulary. Let us try to use this vocabulary and our merge rules to tokenize a given text.

In [11]:
def encode_token(test):
    for rule in rules:
        r_no = rule[0]
        pattern = rule[1]
        replacement = rule[2]
        test = test.replace(pattern, replacement)
        
        print(f"Rule no {r_no} pattern {pattern} replacement {replacement} result {test}")
    
    encoded = [final_vocab[item] for item in test.split(" ")]
    print(encoded)

test = 'l o w e r </w>'
encode_token(test)

Rule no 1 pattern e s replacement es result l o w e r </w>
Rule no 2 pattern es t replacement est result l o w e r </w>
Rule no 3 pattern est </w> replacement est</w> result l o w e r </w>
Rule no 4 pattern l o replacement lo result lo w e r </w>
Rule no 5 pattern lo w replacement low result low e r </w>
[1, 3, 4, 2]


Rule No 1,2 and 3 does not apply to our example. Rule number 4, where characters l and o are replaced by lo. Further according to Rule no 5, lo and w are further merged as low. Finally we perform a lookup for low, e and r in our vocabulary and encode this text. A single world lower in this case is encoded into four integer tokens. Let us see another example.

In [12]:
test = 'l o w e s t </w>'
encode_token(test)

Rule no 1 pattern e s replacement es result l o w es t </w>
Rule no 2 pattern es t replacement est result l o w est </w>
Rule no 3 pattern est </w> replacement est</w> result l o w est</w>
Rule no 4 pattern l o replacement lo result lo w est</w>
Rule no 5 pattern lo w replacement low result low est</w>
[1, 7]


We see the word lowest is encoded as 2 tokens. An implementation of byte-pair encoding is available in https://github.com/openai/tiktoken. This was released by OpenAI.

Google has released SentencePiece,https://github.com/google/sentencepiece. A tokenizer which uses both byte-pair and Unigram algorithms. During pre-tokenization we typcially tend to use whitespace to split the raw text. But there are languagtes where the words can't be split by whitespace. SentencePiece claims to be handy in those cases. It is a langauage agnostic subword tokenization algorithm.


### Post Processing

Let us look a simple example to illustrate the need for post-processing.

In [15]:
test = "speed"
try:
    encode_token(test)
except KeyError:
    print(f"Given word not available in the dictionary")

Rule no 1 pattern e s replacement es result speed
Rule no 2 pattern es t replacement est result speed
Rule no 3 pattern est </w> replacement est</w> result speed
Rule no 4 pattern l o replacement lo result speed
Rule no 5 pattern lo w replacement low result speed
Given word not available in the dictionary


Here is an example where the encoding failed, as the encoding algorithm didnt know how to split this word into subwords. During training the dictionary we did not encounter this word. In these cases, a special token is added to the vocabulary. Let us rewrite our encode function to handle this case.

In [16]:
final_vocab['UNKN'] = 999

def encode_token_v1(test):
    for rule in rules:
        r_no = rule[0]
        pattern = rule[1]
        replacement = rule[2]
        test = test.replace(pattern, replacement)
        
        print(f"Rule no {r_no} pattern {pattern} replacement {replacement} result {test}")
    
    encoded =[]
    for item in test.split(" "):
        if item not in final_vocab.keys():
            encoded.append(final_vocab['UNKN'])
        else:
            encoded.append(final_vocab[item])
    print(encoded)
    
encode_token_v1(test)

Rule no 1 pattern e s replacement es result speed
Rule no 2 pattern es t replacement est result speed
Rule no 3 pattern est </w> replacement est</w> result speed
Rule no 4 pattern l o replacement lo result speed
Rule no 5 pattern lo w replacement low result speed
[999]




In the above example we used a special token **<UNKN>** to handle words which are not in the vocabulary. Some of the additional special tokens include

1. <BOS>, beginning of a sequence, a token to symbolize beginning of a text. This will help LLM understand where the text content begins.
2. <EOS>, end of sequence, a token to symbolize where the text begins.

LLMs are trained using multiple corpuses. These tokens helps them idenify when a token begins and when it ends


::::{important}
:::{note}
While choosing the tokenizer algorithm a key requirment is that no information should be lost during encoding tokens to token-ids.  
:::
::::


## HuggingFace Libraries

Now that we understand the data preparation pipeline, let us introduce the readers to HuggingFace ecosystem and how we can leverage it for building input data pipelines to train LLMs.

In [24]:
from datasets import load_dataset
from pathlib import Path
import os

current_path = Path(os.getcwd())
parent_path  = str(current_path.parent.parent.absolute())


destination = parent_path + '/data/simplebooks/simplebooks-2-raw/'


def download_simplebooks(destination: str) -> None:
    """
    Download simple books dataset
    
    Args:
        destination: download folder string
    """
    url = "https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip"
    http_response = urlopen(url)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path="destination")
    print(f"Finished downloading and extracting simplebooks.zip")



def load_simple_books(destination: str) -> dict:
    """
    
    """
    dataset = load_dataset(destination)
    
    train        = dataset['train']['text']
    test         = dataset['test']['text']
    validation   = dataset['validation']['text'] 
    
    return {"train": train, "test": test, "validation": validation}





Function download_simplebooks downloads the raw input and stores it in the destination folder. The following function load_simple_books uses load_dataset function from datasets library to load this data into memory.

In [25]:
dataset = load_simple_books(destination)

for key,values in dataset.items():
    print(f"{key} rows: {len(values)}")

train rows: 114696
test rows: 14830
validation rows: 13384


As you can see our raw data is now stored as dictionary in memory. Let us now use AutoTokenizer class from Huggingface transformers libary to load subword tokenization algorithm employed by GPT2 model.

In [93]:
from transformers import AutoTokenizer
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2"
                                               ,padding='max_length'
                                               ,truncation="only_second"
                                               ,max_length=10
                                               ,padding_side ="left"
                                               ,bos_token="<BOS>"
                                               ,eos_token="<EOS>"
                                               ,pad_token="<PAD"
                                              )
encode_input = []
for sample in dataset['train'][10:15]:
    if len(sample) >0:
        sample = "<BOS> " + sample + " <EOS>"
        encode_input.append(sample)

print(encode_input)

tokens = gpt2_tokenizer.tokenize(encode_input)
tokens[0:10]

['<BOS> The Girl Monkey And The String Of Pearls <EOS>', '<BOS> One day the king went for a long walk in the woods. When he came back to his own garden, he sent for his family to come down to the lake for a swim. <EOS>']


['<BOS>',
 'ĠThe',
 'ĠGirl',
 'ĠMonkey',
 'ĠAnd',
 'ĠThe',
 'ĠString',
 'ĠOf',
 'ĠPear',
 'ls']

In [92]:
gpt2_tokenizer(encode_input ,padding='max_length'
                                               ,truncation=True
                                               ,max_length=10
                                               ,return_attention_mask=True
              )

{'input_ids': [[50259, 464, 7430, 26997, 843, 383, 10903, 3226, 11830, 7278], [3198, 1110, 262, 5822, 1816, 329, 257, 890, 2513, 287]], 'attention_mask': [[0, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

### Train a custom dictionary

In [87]:
def data_generator(dataset):
    for idx, row in enumerate(dataset):
        yield row

        
print(next(data_generator(dataset['train'])))
print(next(data_generator(dataset['validation'])))


More Jataka Tales
The Story Of A Lamb On Wheels


In [48]:
vocab_size = 52000
generator = data_generator(dataset['train'])
tokenizer = old_tokenizer.train_new_from_iterator(generator, vocab_size)






In [49]:
tokenizer.save_pretrained("../data/simplebooks-tokenizer")

('../data/simplebooks-tokenizer/tokenizer_config.json',
 '../data/simplebooks-tokenizer/special_tokens_map.json',
 '../data/simplebooks-tokenizer/vocab.json',
 '../data/simplebooks-tokenizer/merges.txt',
 '../data/simplebooks-tokenizer/added_tokens.json',
 '../data/simplebooks-tokenizer/tokenizer.json')

In [50]:
simplebooks_tokenizer = AutoTokenizer.from_pretrained("../data/simplebooks-tokenizer")
simplebooks_tokenizer.tokenize(sample)

['More', 'ĠJataka', 'ĠTales']

In [51]:
simplebooks_tokenizer.encode(sample)

[6483, 28923, 10076]

In [52]:
simplebooks_tokenizer(sample)

{'input_ids': [6483, 28923, 10076], 'attention_mask': [1, 1, 1]}

### Simple Books Pytorch Dataset

Let us put together what we have learned till now.

In [97]:
from datasets import load_dataset


import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
import os
from pathlib import Path

current_path = Path(os.getcwd())
parent_path  = str(current_path.parent.parent.absolute())


def get_tokenizer():
    
    tokenizer_path = parent_path + "/data/simplebooks-tokenizer"
    print(f"Loading tokenizer from {tokenizer_path}")
    simplebooks_tokenizer = AutoTokenizer.from_pretrained(str(tokenizer_path))
    return simplebooks_tokenizer


def download_simplebooks(destination: str) -> None:
    """
    Download simple books dataset
    
    Args:
        destination: download folder string
    """
    url = "https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip"
    http_response = urlopen(url)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path="destination")
    print(f"Finished downloading and extracting simplebooks.zip")


def load_simple_books(destination: str) -> dict:
    """
    
    """
    dataset = load_dataset(destination)
    
    train        = dataset['train']['text']
    test         = dataset['test']['text']
    validation   = dataset['validation']['text'] 
    
    return {"train": train, "test": test, "validation": validation}
                                         



class SimpleBooksDataSet(Dataset):
    """

    """
    def __init__(self, corpus, max_length, stride, context='train'):
        """

        """
        path = parent_path + "/data/simplebooks-tokenizer"
        self.tokenizer  = AutoTokenizer.from_pretrained(path)
        self.input_ids  = []
        self.target_ids = []
        self.token_ids  = []

        for sample in corpus:
            if len(sample) > 0:
                self.token_ids.extend(self.tokenizer.encode(sample, truncation=True, max_length=max_length))
        
        print(f"Total {context} tokens {len(self.token_ids)}")
        
        for i in range(0, len(self.token_ids) - max_length + 1,stride):
            input_chunk =  self.token_ids[i:i + max_length]
            target_chunk = self.token_ids[i + 1: i + max_length + 1]
            
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))


    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


    
def stack_collate(data):
    features, target = zip(*data)
    X = torch.stack(features)
    y = torch.stack(target)
    return X, y
    
def get_dataloaders(batch_size=16, num_workers=4):
    """
    
    """

    
    destination = parent_path + '/data/simplebooks/simplebooks-2-raw/'
    
    print(f"Loading dataset from {destination}")
    
    dataset     = load_simple_books(destination)
    
    training_corpus   = dataset['train']
    validation_corpus = dataset['validation']

    
    train_ds      = SimpleBooksDataSet(training_corpus, max_length=50, stride=50)            
    validation_ds = SimpleBooksDataSet(validation_corpus,max_length=50, stride=50, context='validation')  


    train_dataloader = DataLoader(dataset=train_ds, batch_size=batch_size, collate_fn=stack_collate,
                                  shuffle=True,drop_last=True,num_workers=num_workers)

    validation_dataloader = DataLoader(dataset=validation_ds, batch_size=batch_size, collate_fn=stack_collate,
                                  shuffle=False,drop_last=True,num_workers=num_workers)
    
    return train_dataloader, validation_dataloader

                             


In [98]:
train_dataloader, validation_dataloder = get_dataloaders(batch_size=2, num_workers=0)


for x,y in train_dataloader:
    print(x)
    print(y)
    break

Loading dataset from /home/gopi/Documents/small_llm/llmbook/data/simplebooks/simplebooks-2-raw/
Total train tokens 1676477
Total validation tokens 189785
tensor([[  653,  1055,    12,  2165,  3359,    12,   581,    26,     2,  1640,
           372,   434,   925,  2502,    31,  2577,   372,   434,   925,  2502,
            31,   295,   448,   585,   392,   260,   925,   381,  1594,    14,
          1098,  2012,   271,   260,  8612,   536,   260,   666,   396,  2978,
            14,   935,   357,    12,   351,   341,   552,  2978,    12,   410],
        [ 1633,   271,   421,  3481,   589,    12,   270,   351,   404,   348,
            12,   921,   344,   259,  1392,    12,  1121,   283,   341,  1370,
           260,  1539,    14,  3481, 10928,   432,  3149,   341,   624,   822,
           303,   260,   922,    14,   590,   469,  2388,   466,  1156,  3618,
           351,   433,    12,   270,   559,   561,   922,   618,   270,   594]])
tensor([[ 1055,    12,  2165,  3359,    12,   581,   

## Word embedding

Words are represented in a continous space. The idea is in this new vector space, the words semantically close to each other should be also close in vector space and we should be able to use standard distance functions, like euclidean and cosine to find the similarity between words.embedding. It is easy to imagine a 2 dimensional space. In this 2d space, each word is represented by a co-ordinate.

Look at the following figure


Figure 3 Word Embedding.
(reference:word_embedding)=
```{figure} ../../images/chapter1/word_embedding.drawio.png
---
height: 250px
name: preprocessing
---
Word Embedding example
```

In the figure, word King and Man are represented in a 2D continous space. With this represenation, we can now compare these two words. Words semantically close to each other should be close to each other in this 2D vector space. For illustration purpose we had kept the vector dimension to 2. In LLMs these are much larger than 2. GPT uses 12288 dimension embeddings.



In [108]:
import torch.nn as nn


encoded_input, encoded_taret = next(iter(train_dataloader))
embedding_dim  = 32
vocab_size = 52000

wte = nn.Embedding(vocab_size, embedding_dim)

embedded_input = wte(encoded_input)
# (batch_size, context_window, embedding_dimension)
print(embedded_input.shape)
embedded_input

torch.Size([2, 50, 32])


tensor([[[-2.1568e-01,  1.9251e-01,  1.6439e-01,  ...,  8.1474e-01,
           2.1049e+00, -9.9629e-01],
         [-2.4471e-01, -2.6456e-01, -8.3230e-01,  ...,  5.6894e-01,
           1.2458e-01, -6.7888e-02],
         [ 1.3079e+00, -2.0478e+00,  2.9307e+00,  ...,  2.2039e-01,
          -9.5471e-01, -7.1041e-01],
         ...,
         [ 1.4433e+00,  9.8329e-01, -1.4033e-01,  ...,  2.0992e-02,
          -2.5070e-01, -1.9180e-01],
         [ 2.3441e-01, -1.5142e+00, -1.3133e-01,  ...,  1.2781e+00,
           1.3794e+00, -1.2916e-01],
         [ 4.1121e-04, -3.6518e-01,  1.5661e+00,  ...,  2.3526e+00,
           5.3088e-01, -6.4193e-01]],

        [[-6.3845e-02,  8.5167e-01,  1.4399e-01,  ...,  4.2325e-01,
           4.8302e-01,  1.4194e+00],
         [ 1.5958e+00, -1.0533e+00,  5.0496e-01,  ..., -2.6729e-01,
           6.8486e-01,  2.4605e+00],
         [-9.4344e-01, -4.9438e-01, -1.1369e+00,  ..., -8.5355e-01,
          -8.4444e-01,  1.6266e+00],
         ...,
         [ 2.0965e-01, -4

nn.Embedding from pytroch is a trainable lookup table. We intialized it using the size of our vocabulary and expected dimension for embedding. Each word in our vocabulary is a row in this lookup. When we pass the token ids to this lookup we get the embedding vector for that token id. Each token is now a 32 dimension vector.

Initially these embeddings are random. As the model trains, the word embeddings are learned. It is a design choice to load a pre-learned embedding and either keep them outside the model learning. During the learning process, distribution semantics of the words are leveraged to place semantically similar words close to each other in the new embedding vector space. According to distributional semantics, words with similar meanings are more likely to occur in similar context. When a large corpus is used for training, we hope to provide visibility to such contexts to our model.




## Position Embedding


"A women is nothing without her man"
"A man is nothing without her women"

These two sentences share the same words. They will hence share the same embeddings. For neural network both the sentences mean the same. But we know they convey a different meaning. The model needs to be aware of the position of the tokens in the input. This is where position embedding comes to play.

A simple solution is to have a embedding dictionary similar to word embedding. A dictionary with an entry for each position.


```note
context window defines the maximum size of the input an LLM can ingest, the maximum number of tokens it can ingest for it to generate a response. 
```

In [128]:
context_window = 50
embedding_dim = 32

pe = nn.Embedding(context_window, embedding_dim)

input_length = len(encoded_input[-1])
batch_size = encoded_input.shape[0]
print(f"Batch size {batch_size} context_window {input_length}")
positions = torch.tensor(range(input_length)).repeat(2,1)
print(positions.shape)
positions

Batch size 2 context_window 50
torch.Size([2, 50])


tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
        [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])

In [129]:
position_embedding = pe(positions)
print(position_embedding.shape)
position_embedding

torch.Size([2, 50, 32])


tensor([[[-0.0659,  0.3402,  0.4823,  ...,  0.0778, -1.2551,  0.1716],
         [-1.8011,  0.0123, -0.1828,  ..., -0.1942, -0.7024,  0.6973],
         [ 0.2235, -1.4827,  0.3727,  ...,  0.9452,  2.1436,  0.4244],
         ...,
         [-0.1704,  0.2706, -0.5948,  ..., -1.7537, -0.5171,  1.7966],
         [ 0.8878,  1.1431, -0.0761,  ..., -0.0164, -0.9832, -0.6537],
         [ 0.0917, -0.4515,  0.7106,  ..., -0.0086, -1.1809,  0.0525]],

        [[-0.0659,  0.3402,  0.4823,  ...,  0.0778, -1.2551,  0.1716],
         [-1.8011,  0.0123, -0.1828,  ..., -0.1942, -0.7024,  0.6973],
         [ 0.2235, -1.4827,  0.3727,  ...,  0.9452,  2.1436,  0.4244],
         ...,
         [-0.1704,  0.2706, -0.5948,  ..., -1.7537, -0.5171,  1.7966],
         [ 0.8878,  1.1431, -0.0761,  ..., -0.0164, -0.9832, -0.6537],
         [ 0.0917, -0.4515,  0.7106,  ..., -0.0086, -1.1809,  0.0525]]],
       grad_fn=<EmbeddingBackward0>)

Finally we add the word token embedding and position embedding to feed as input to the transformer.

In [127]:
input_to_transformer = embedded_input + position_embedding
input_to_transformer.shape

torch.Size([2, 50, 32])

We encoded the absolute position of each toke thus the name absolute positoinal embedding. However one drawback here is each position's embedding is independent of the other. From a models perspective it will not know how far is position 500 from 2. To summarize position embedding should be monotonic. The more two tokens are closer to each other, they should influence each other.


### Relative Position embedding

Relative position embedding leverages the distance between pairs of tokens. These techniques aler the attention mechanism. More about attention mechanism in the next chapter.

## Pre-training data engineering

```{figure} ../../images/chapter1/data_pipeline.png
---
height: 350px
name: Data Engineering
---
Data Engineering pipeline
```

The typical approach is concat-and-chunk. They convert text datasets with variable document lengths into sequences with a fixed target length. First we randomly shuffle and concatenate all tokenized documents. Consequtive concatenated documents are separated by a special token '<EOT>', allowing models to discover document boundaries. We then chunk the concatenated sequence into subsequences with a target sequence length. For example 2048 and 4096 for llama1 and llama2
    



### Data sources
    
    Common datasources include The Pile, RefinedWeb, RedPajma and DOLMA.

### Data pipeline

Filtration, deduplication, diversifacation.
    
1. A classifier to classify a document as high quality or low quality. Then documents are passed through this classifier and only high quality documents are filtered.
    
2. MinHashLSH techniques are used for deduplication. Deduplication is done with the document and across the documents.
    
3. For diversification, other curated, tailored datasets are included.
    

### Data quality

### Data Bias

### Privacy and eithical cosiderations
    
### Synthetic data generation through LLM
    

    Microsoft Phi models were mostly trained on synthetic data.
    Cosmopedia a dataset consisting of synthetic textbooks, blog posts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. Has 30 million files and 25 billion tokens.

## Conclustion

## Further Reading