**Building a Crime GPT LLM by Ravi Kumar Sinha**

This code is based on *Build a Large Language Model (From Scratch)*, [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)

<br>
<br>
<br>
<br>

# 2) Trying to input the Crime Novel GPT Data

Packages that are being used in this notebook:
We are going to use different libraries here. Pytorch,tiktoken and pandas are few of them

In [1]:
!pip install transformers -q
!pip install datasets -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from importlib.metadata import version
import pandas as pd
import transformers
from datasets import load_dataset

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))
print(f"Transformers version: {transformers.__version__}")


torch version: 2.2.1+cu121
tiktoken version: 0.7.0
Transformers version: 4.42.3


- This notebook provides a brief overview of the data preparation and sampling procedures to get input data "ready" for an LLM
- Understanding what the input data looks like is a great first step towards understanding how LLMs work

<img src="./figures/01.png" width="700px">

Let's Download the dataset 
Here I am going to make a mystery-crime-book GPT with Llama 3.2

In [3]:
data = load_dataset('AlekseyKorshuk/mystery-crime-books')

In [4]:
data['train']['url']

['https://www.bookrix.com/_ebook-sir-arthur-conan-doyle-the-adventures-of-sherlock-holmes/',
 'https://www.bookrix.com/_ebook-bob-moats-classmate-murders/',
 'https://www.bookrix.com/_ebook-david-burgess-the-samsara-project/',
 'https://www.bookrix.com/_ebook-sir-arthur-conan-doyle-the-hound-of-the-baskervilles-1/',
 'https://www.bookrix.com/_ebook-sarah-j-maas-throne-of-glass/',
 'https://www.bookrix.com/_ebook-edited-by-julian-hawthorne-library-of-the-world-039-s-best-mystery-and-detective-stories/',
 'https://www.bookrix.com/_ebook-robert-f-clifton-the-house-on-timber-lane/',
 'https://www.bookrix.com/_ebook-amardeep-kaur-randhawa-the-girl-next-door/',
 'https://www.bookrix.com/_ebook-agatha-christie-murder-on-the-orient-express/',
 'https://www.bookrix.com/_ebook-sir-arthur-conan-doyle-the-return-of-sherlock-holmes/',
 'https://www.bookrix.com/_ebook-robert-f-clifton-garwood-village/',
 'https://www.bookrix.com/_ebook-kelly-abell-haunted-destiny/',
 'https://www.bookrix.com/_ebook-

In [5]:
with open('Nov_22_output_sherlock_holmes_first_book.txt', 'w') as f:
    for entry in data['train']['text'][0]:
        #text = entry['text']  # Adjust this key based on the dataset
        f.write(entry)

In [6]:
pwd

'/teamspace/studios/this_studio'

In [7]:
data = load_dataset('AlekseyKorshuk/mystery-crime-books')
one_book = data['train']['text'][0]
print(one_book[:1000])

Sir Arthur Conan Doyle The Adventures of Sherlock Holmes 
 
 
   
   
 I. A Scandal in Bohemia 
 II. The Red-headed League 
  III. A Case of Identity 
   IV. The Boscombe Valley Mystery 
   V. The Five Orange Pips 
   VI. The Man with the Twisted Lip 
   VII. The Adventure of the Blue Carbuncle  
   VIII. The Adventure of the Speckled Band 
   IX. The Adventure of the Engineer's Thumb 
  X. The Adventure of the Noble Bachelor 
   XI. The Adventure of the Beryl Coronet 
   XII. The Adventure of the Copper Beeches 
 
ADVENTURE I.   A SCANDAL IN BOHEMIA 
 
To Sherlock Holmes she is always THE woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a 

- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above

In [8]:
# Save the text to a .txt file
with open('The_adventure_of_sherlock_holmes.txt', 'w', encoding='utf-8') as file:
    file.write(one_book)

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

<img src="figures/02.png" width="600px">

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [10]:
!ls

LLM-workshop-2024			      crime_gpt		  test.json
Nov_22_output_sherlock_holmes_first_book.txt  gpt2		  train.json
The_adventure_of_sherlock_holmes.txt	      loss-plot.pdf
checkpoints				      sherlock_model.pth


In [11]:
one_book



In [12]:
print("Total number of character:", len(one_book))
print(one_book[:99])

Total number of character: 562489
Sir Arthur Conan Doyle The Adventures of Sherlock Holmes 
 
 
   
   
 I. A Scandal in Bohemia 
 II


<img src="figures/03.png" width="600px">

- The following regular expression will split on whitespaces and punctuation

In [13]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', one_book)
#preprocessed = [item for item in preprocessed if item]
print(preprocessed[:38])

['Sir', ' ', 'Arthur', ' ', 'Conan', ' ', 'Doyle', ' ', 'The', ' ', 'Adventures', ' ', 'of', ' ', 'Sherlock', ' ', 'Holmes', ' ', '', '\n', '', ' ', '', '\n', '', ' ', '', '\n', '', ' ', '', ' ', '', ' ', '', '\n', '', ' ']


In [14]:
preprocessed

['Sir',
 ' ',
 'Arthur',
 ' ',
 'Conan',
 ' ',
 'Doyle',
 ' ',
 'The',
 ' ',
 'Adventures',
 ' ',
 'of',
 ' ',
 'Sherlock',
 ' ',
 'Holmes',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 ' ',
 '',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 ' ',
 '',
 ' ',
 '',
 '\n',
 '',
 ' ',
 'I',
 '.',
 '',
 ' ',
 'A',
 ' ',
 'Scandal',
 ' ',
 'in',
 ' ',
 'Bohemia',
 ' ',
 '',
 '\n',
 '',
 ' ',
 'II',
 '.',
 '',
 ' ',
 'The',
 ' ',
 'Red-headed',
 ' ',
 'League',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 ' ',
 'III',
 '.',
 '',
 ' ',
 'A',
 ' ',
 'Case',
 ' ',
 'of',
 ' ',
 'Identity',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 ' ',
 '',
 ' ',
 'IV',
 '.',
 '',
 ' ',
 'The',
 ' ',
 'Boscombe',
 ' ',
 'Valley',
 ' ',
 'Mystery',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 ' ',
 '',
 ' ',
 'V',
 '.',
 '',
 ' ',
 'The',
 ' ',
 'Five',
 ' ',
 'Orange',
 ' ',
 'Pips',
 ' ',
 '',
 '\n',
 '',
 ' ',
 '',
 ' ',
 '',
 ' ',
 'VI',
 '.',
 '',
 ' ',
 'The',
 ' ',
 'Man',
 ' ',
 'with',
 ' ',
 'the',
 ' ',
 

In [15]:
print("Number of tokens:", len(preprocessed))

Number of tokens: 258835


No. of Unique tokens in this dataset :8811

In [15]:
len(set(preprocessed))

8811

After removing all of the starting punctuations

 Punctuations Removed Vocabulary

In [16]:
punctuations_removed_data = sorted(set(preprocessed))[98:8811]
punctuations_removed_data

['A',
 'ADLER',
 'ADVENTURE',
 'ARAT',
 'Abbots',
 'Aberdeen',
 'About',
 'Above',
 'Absolute',
 'Absolutely',
 'Accustomed',
 'Across',
 'Adler',
 'Adventure',
 'Adventures',
 'Affairs',
 'Afghan',
 'Afghanistan',
 'After',
 'Again',
 'Agra',
 'Ah',
 'Air',
 'Alas',
 'Albert',
 'Aldersgate',
 'Aldershot',
 'Alexander',
 'Alice',
 'Alicia',
 'All',
 'Allegro',
 'Aloysius',
 'Alpha',
 'Already',
 'Also',
 'Altogether',
 'Always',
 'Amateur',
 'America',
 'American',
 'Americans',
 'Amid',
 'Among',
 'Amoy',
 'Ample',
 'An',
 'And',
 'Anderson',
 'Andover',
 'Angel',
 'Another',
 'Anstruther',
 'Any',
 'Anybody',
 'Anyhow',
 'Anything',
 'Apache',
 'Apaches',
 'Apply',
 'April',
 'Arabian',
 'Archery',
 'Archie',
 'Architecture',
 'Are',
 'Arizona',
 'Armitage',
 'Armour',
 'Arms',
 'Arnsworth',
 'Arthur',
 'Artillery',
 'As',
 'Assizes',
 'Astonishment',
 'At',
 'Atkinson',
 'Atlantic',
 'Attica',
 'Auckland',
 'Augustine',
 'Australia',
 'Australian',
 'Australians',
 'Avenue',
 'Awake

<br>
<br>
<br>
<br>

# 2.2 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later
- For this we first need to build a vocabulary

<img src="figures/04.png" width="900px">

- The vocabulary contains the unique words in the input text

In [17]:
all_words = sorted(set(punctuations_removed_data))
vocab_size = len(all_words)

print(vocab_size)

8713


In [18]:
for i in enumerate(punctuations_removed_data):
    print(i)

(0, 'A')
(1, 'ADLER')
(2, 'ADVENTURE')
(3, 'ARAT')
(4, 'Abbots')
(5, 'Aberdeen')
(6, 'About')
(7, 'Above')
(8, 'Absolute')
(9, 'Absolutely')
(10, 'Accustomed')
(11, 'Across')
(12, 'Adler')
(13, 'Adventure')
(14, 'Adventures')
(15, 'Affairs')
(16, 'Afghan')
(17, 'Afghanistan')
(18, 'After')
(19, 'Again')
(20, 'Agra')
(21, 'Ah')
(22, 'Air')
(23, 'Alas')
(24, 'Albert')
(25, 'Aldersgate')
(26, 'Aldershot')
(27, 'Alexander')
(28, 'Alice')
(29, 'Alicia')
(30, 'All')
(31, 'Allegro')
(32, 'Aloysius')
(33, 'Alpha')
(34, 'Already')
(35, 'Also')
(36, 'Altogether')
(37, 'Always')
(38, 'Amateur')
(39, 'America')
(40, 'American')
(41, 'Americans')
(42, 'Amid')
(43, 'Among')
(44, 'Amoy')
(45, 'Ample')
(46, 'An')
(47, 'And')
(48, 'Anderson')
(49, 'Andover')
(50, 'Angel')
(51, 'Another')
(52, 'Anstruther')
(53, 'Any')
(54, 'Anybody')
(55, 'Anyhow')
(56, 'Anything')
(57, 'Apache')
(58, 'Apaches')
(59, 'Apply')
(60, 'April')
(61, 'Arabian')
(62, 'Archery')
(63, 'Archie')
(64, 'Architecture')
(65, 'Are')


In [18]:
vocab = {token:integer for integer,token in enumerate(all_words)}

- Below are the first 50 entries in this vocabulary:

In [22]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('A', 0)
('ADLER', 1)
('ADVENTURE', 2)
('ARAT', 3)
('Abbots', 4)
('Aberdeen', 5)
('About', 6)
('Above', 7)
('Absolute', 8)
('Absolutely', 9)
('Accustomed', 10)
('Across', 11)
('Adler', 12)
('Adventure', 13)
('Adventures', 14)
('Affairs', 15)
('Afghan', 16)
('Afghanistan', 17)
('After', 18)
('Again', 19)
('Agra', 20)
('Ah', 21)
('Air', 22)
('Alas', 23)
('Albert', 24)
('Aldersgate', 25)
('Aldershot', 26)
('Alexander', 27)
('Alice', 28)
('Alicia', 29)
('All', 30)
('Allegro', 31)
('Aloysius', 32)
('Alpha', 33)
('Already', 34)
('Also', 35)
('Altogether', 36)
('Always', 37)
('Amateur', 38)
('America', 39)
('American', 40)
('Americans', 41)
('Amid', 42)
('Among', 43)
('Amoy', 44)
('Ample', 45)
('An', 46)
('And', 47)
('Anderson', 48)
('Andover', 49)
('Angel', 50)


- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="figures/05.png" width="600px">

- Let's now put it all together into a tokenizer class

In [23]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Note: I removed double quotes from regex

In [24]:
tokenizer = SimpleTokenizerV1(vocab)

In [25]:
tokenizer.decode([927, 71, 225, 288, 1011, 14, 5631])

'Sir Arthur Conan Doyle The Adventures of'

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

<img src="figures/06.png" width="600px">

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [26]:
tokenizer = SimpleTokenizerV1(vocab)

text = """Sir Arthur Conan Doyle The Adventures of Sherlock Holmes"""
ids = tokenizer.encode(text)
print(ids)

[927, 71, 225, 288, 1011, 14, 5631, 913, 495]


Note: You might get a key error if there is no such double quotes(") in the txt or data

- We can decode the integers back into text

In [27]:
tokenizer.decode(ids)

'Sir Arthur Conan Doyle The Adventures of Sherlock Holmes'

In [28]:
tokenizer.decode(tokenizer.encode(text))

'Sir Arthur Conan Doyle The Adventures of Sherlock Holmes'

<br>
<br>
<br>
<br>

# 2.3 BytePair encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer
- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges
- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)
- In this lecture, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance
- (Based on an analysis [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb), I found that `tiktoken` is approx. 3x faster than the original tokenizer and 6x faster than an equivalent tokenizer in Hugging Face)

In [29]:
 pip install tiktoken -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [30]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.7.0


In [31]:
tokenizer = tiktoken.get_encoding("gpt2")

In [32]:
text = (
    "Sir Arthur Conan Doyle The Adventures of Sherlock Holmes. <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[22788, 13514, 31634, 31233, 383, 15640, 286, 25730, 17628, 13, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [33]:
strings = tokenizer.decode(integers)

print(strings)

Sir Arthur Conan Doyle The Adventures of Sherlock Holmes. <|endoftext|> In the sunlit terracesof someunknownPlace.


- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="figures/07.png" width="600px">

In [34]:
tokenizer.encode("Akwirw ier", allowed_special={"<|endoftext|>"})

[33901, 86, 343, 86, 220, 959]

<br>
<br>
<br>
<br>

# 2.4 Data sampling with a sliding window

- Above, we took care of the tokenization (converting text into word tokens represented as token ID numbers)
- Now, let's talk about how we create the data loading for LLMs
- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict

<img src="figures/08.png" width="600px">

- For this, we use a sliding window approach, changing the position by +1:

<img src="figures/09.png" width="900px">

- Note that in practice it's best to set the stride equal to the context length so that we don't have overlaps between the inputs (the targets are still shifted by +1 always)

<img src="figures/10.png" width="600px">

In [35]:
data

DatasetDict({
    train: Dataset({
        features: ['url', 'text'],
        num_rows: 359
    })
})

In [36]:
from supplementary import create_dataloader_v1


dataloader = create_dataloader_v1(one_book, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[22788, 13514, 31634, 31233],
        [  383, 15640,   286, 25730],
        [17628,   220,   198,   220],
        [  198,   220,   198,   220],
        [  220,   220,   198,   220],
        [  220,   220,   198,   314],
        [   13,   317,  1446,  7642],
        [  287, 45560,   544,   220]])

Targets:
 tensor([[13514, 31634, 31233,   383],
        [15640,   286, 25730, 17628],
        [  220,   198,   220,   198],
        [  220,   198,   220,   220],
        [  220,   198,   220,   220],
        [  220,   198,   314,    13],
        [  317,  1446,  7642,   287],
        [45560,   544,   220,   198]])


<br>
<br>
<br>
<br>

# Exercise: Prepare your own favorite text dataset