#### **Okay so here i try to do my own experiment with diff dataset and see... here i build simple tokenization from scratch by myself**

- keep in mind this simple tokenization treats each work and character as a token... we'll build advanced tokenizations later on

**we first start with loading the dataset, here we're using the `harry-potter book` dataset**

**steps**
1. read/download the text dataset
2. remove break word and characters into tokens(into separate words & characters)
    - **here there is an option of choosing wether to use the whitespaces as tokens or discard them. But deciding wether to keep or discard them depends on the text dataset you're working with, for eg, if you're working with training a model to code... it is essential to keep the spaces cause each space has a meaning in the text dataset... for eg: python respect indentation so much that, without indentations, your code breaks**
3. after converting raw dataset into tokens, `we then convert the tokens in to vocabularies. a dictionary containing all the unique words/character your datasets sorted alphabetically with it token ids`
4. we code on to best our simple `encoders` and `decoders`

##### **Reading dataset**

In [3]:
with open("../data/harry-potter.txt", "r", encoding="utf-8") as f:
    raw_data = f.read()
    
print(len(raw_data))
print(raw_data[:99])


2652657
Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of nu


In [21]:
import re

In [22]:
# testing how our regular expression behaves on our datasets

text = "Harry\tPotter -- the boy who lived.\nHe said: 'Hello!'...."
text2 = """
Harry Potter and the Sorcerer's Stone


CHAPTER ONE

THE BOY WHO LIVED

Mr. and Mrs. Dursley, of nu
"""
pre_text = re.split(r'(--|\.{3}|[.,:;_"\'()&!?$%^*\-=]|\s)', text)
pre_text = [item.strip() for item in pre_text if item.strip()]
print(pre_text) 

['Harry', 'Potter', '--', 'the', 'boy', 'who', 'lived', '.', 'He', 'said', ':', "'", 'Hello', '!', "'", '...', '.']


In [26]:
# preprocessing actual
# the `|` in re mean `OR` operator and `&` means `AND`
# what this our re mean is that if you see `--` or `.{3} which means 3 dots ie, ...` or see any or the the punctuations in the `[]` or see `\s which means space` split them.
preprocessed_data = re.split(r'(--|\.{3}|[.,:;_"\'()&!?$%^*\-=]|\s)', raw_data) 
# strip our whitespaces
preprocessed_data = [item.strip() for item in preprocessed_data if item.strip()]
print(preprocessed_data[:99])

['Harry', 'Potter', 'and', 'the', 'Sorcerer', "'", 's', 'Stone', 'CHAPTER', 'ONE', 'THE', 'BOY', 'WHO', 'LIVED', 'Mr', '.', 'and', 'Mrs', '.', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', "'", 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'didn', "'", 't', 'hold', 'with', 'such', 'nonsense', '.', 'Mr', '.', 'Dursley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'Grunnings', ',', 'which', 'made', 'drills', '.', 'He', 'was', 'a', 'big', ',', 'beefy', 'man', 'with', 'hardly', 'any', 'neck']


#### **converting tokens into vocabularires**

In [50]:
all_words = sorted(set(preprocessed_data))
all_words.extend(["<|endoftext|>", "<|unk|>"])
print(all_words[:99])
print(all_words[-5:])
print(f"vocab len = {len(all_words)}")

['!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '--', '.', '...', '/', '0', '07', '08', '1', '100', '101', '102', '104', '105', '106', '107', '108', '11', '110', '111', '112', '113', '114', '115', '116', '117', '118', '12', '122', '123', '124', '125', '126', '127', '128', '1289', '129', '1296', '13', '130', '131', '132', '133', '134', '135', '136', '137', '138', '14', '140', '141', '142', '143', '144', '145', '146', '1473', '148', '149', '1492', '150', '151', '152', '154', '157', '158', '159', '16', '160', '161', '1612', '162', '163', '1637', '164', '165', '166', '167', '168', '169', '17', '170', '1709', '171', '1722', '173', '176', '177', '178']
['zooming', '}', '�', '<|endoftext|>', '<|unk|>']
vocab len = 17665


In [51]:
# creating vocabs dict
vocabs:dict = {token:token_id for token_id, token in enumerate(all_words)}
vocabs.items()




#### **creating encoders and decoders**

**so remember** 
- encoder is: input text -> tokens -> token ids
- decoder is: token ids -> tokens -> input text


So for this we create a python class for this just to make our life easy

In [67]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.token_to_token_id = vocab
        # the reason is being that, in the vocab we have the format {token:token_id} so here we do the reverse of that
        self.token_id_to_token = {token_id:token for token,token_id in vocab.items()}
        
    def encode(self, text):
        preprocess_text = re.split(r'(--|\.{3}|[.,:;_"\'()&!?$%^*\-=]|\s)', text)
        preprocess_text = [item.strip() for item in preprocess_text if item.strip()]
        
        # so what this line is saying for each item in the preprocessed text, if the item is found in the vocab, add the token id to the list, if not add the token id for tokens that are not found in the vocabs
        preprocess_text = [
            item if item in self.token_to_token_id else "<|unk|>" for item in preprocess_text 
        ]
        
        # note token_to_token_id is a dict, to it like accessing the dict my key, which the key here is the token right. so getting the token key returns us the token id, simple as that
        token_ids = [self.token_to_token_id[token] for token in preprocess_text]
        return token_ids
    
    def decode(self, ids):
        text = " ".join([self.token_id_to_token[token_id] for token_id in ids])
        # replace spaces before the specified punctuations
        # eg of what this line does, eg: John is a boy . -> John is a boy.
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
        

In [68]:
tokenizer = SimpleTokenizer(vocabs)
text = "Harry\tPotter's -- the boy who lived.\nHe said: 'Hello!'...."
encoded_text = tokenizer.encode(text)
decoded_text = tokenizer.decode(encoded_text)
print(encoded_text)
print(decoded_text)

[1929, 3101, 5, 13894, 11, 16004, 5751, 17340, 10925, 12, 1940, 13919, 285, 5, 1962, 0, 5, 13, 12]
Harry Potter' s -- the boy who lived. He said :' Hello!'....


In [69]:
text2 = """
"This might help, look -- a manticore savaged someone in 1296, and they
let the manticore off -- oh -- no, that was only because everyone was
too scared to go near it."
"""
print(tokenizer.encode(text2))
print(tokenizer.decode(tokenizer.encode(text2)))

[1, 3955, 11363, 9766, 9, 10975, 11, 4517, 11161, 13976, 14892, 10158, 47, 9, 4824, 16021, 10815, 16004, 11161, 11937, 11, 11957, 11, 11798, 9, 16001, 17155, 11979, 5316, 8229, 17155, 16245, 14015, 16205, 9304, 11697, 10423, 12, 1]
" This might help, look -- a manticore savaged someone in 1296, and they let the manticore off -- oh -- no, that was only because everyone was too scared to go near it."


In [72]:
full_text = " <|endoftext|> ".join([text, text2, "man... this guy Harry Potter is good!"])
print(tokenizer.encode(full_text))
print(tokenizer.decode(tokenizer.encode(full_text)))

[1929, 3101, 5, 13894, 11, 16004, 5751, 17340, 10925, 12, 1940, 13919, 285, 5, 1962, 0, 5, 13, 12, 17663, 1, 3955, 11363, 9766, 9, 10975, 11, 4517, 11161, 13976, 14892, 10158, 47, 9, 4824, 16021, 10815, 16004, 11161, 11937, 11, 11957, 11, 11798, 9, 16001, 17155, 11979, 5316, 8229, 17155, 16245, 14015, 16205, 9304, 11697, 10423, 12, 1, 17663, 11133, 13, 16053, 17664, 1929, 3101, 10414, 9335, 0]
Harry Potter' s -- the boy who lived. He said :' Hello!'.... <|endoftext|>" This might help, look -- a manticore savaged someone in 1296, and they let the manticore off -- oh -- no, that was only because everyone was too scared to go near it." <|endoftext|> man... this <|unk|> Harry Potter is good!
