#### Step 1 : Data Preparation

In [85]:
with open('the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

print("Total Characters in file :", len(raw_text))
print(raw_text[:99])

Total Characters in file : 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Goal is to tokenize all 20,479 words into tokens, assign token IDs to these tokens and convert these tokens/token IDs into vector embeddings.

Testing word separation schema on small subset of text.

In [86]:
import re 
test_text = "Hello, world, Is this--  a test?" 
result = re.split(r'([,.:;?_!"()\']|--|\s)', test_text)
print(result)

['Hello', ',', '', ' ', 'world', ',', '', ' ', 'Is', ' ', 'this', '--', '', ' ', '', ' ', 'a', ' ', 'test', '?', '']


Applying word separation(pre-processing) schema to file.

Tokenize entire short story :

In [87]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
print(preprocessed[:25])
print(len(preprocessed))

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '--', 'though', ' ', 'a', ' ', 'good']
9235


#### Step 2 : Creating Token IDs

In [88]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1133


In [89]:
# vocab = {token:integer for integer,token in enumerate(all_words)}

# for i, item in enumerate(vocab.items()):
#     print(i,item)
#     if i >= 45:
#         break

Tokenizer Class that has encode and decode methods 

In [90]:
class Tokenizer : 
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        # Pre-processed text. 
        # i.e sentences that are broken up into words.
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        token_ids = [self.str_to_int[strngs] for strngs in preprocessed]
        return token_ids
    
    def decode(self, token_ids):
        text = "".join([self.int_to_str[ints] for ints in token_ids])
        return text

In [91]:
# Initialize tokenizer class
# Make sure not to overwrite the class name 'Tokenizer'
tokenizer_instance = Tokenizer(vocab)  # Use a different variable name for the instance

test_text = "It was not till three years later that, in the course of a few weeks' idling on the Riviera"


# Use the instance to encode and decode
token_ids_for_test_text = tokenizer_instance.encode(test_text)
print("\n")
print(test_text)
print(token_ids_for_test_text)
print("\n")
print("Token ID back to text : ")
tokenid_to_text = tokenizer_instance.decode(token_ids_for_test_text)
print(tokenid_to_text)



It was not till three years later that, in the course of a few weeks' idling on the Riviera
[59, 2, 1080, 2, 714, 2, 1013, 2, 1007, 2, 1126, 2, 607, 2, 990, 8, 0, 2, 571, 2, 991, 2, 300, 2, 725, 2, 118, 2, 440, 2, 1088, 5, 0, 2, 568, 2, 730, 2, 991, 2, 87]


Token ID back to text : 
It was not till three years later that, in the course of a few weeks' idling on the Riviera


Adding End-of-Text and UNK tokens to extend vocabulary in case where we encounter a unseen word. 

In [92]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend((["<|endoftext|>", "<|unk|>"]))

vocab2 = {token:integer for integer, token in enumerate(all_tokens)}

In [93]:
len(vocab2)

for i, item in enumerate(list(vocab2.items())[-5:]):
    print(item)

('younger', 1130)
('your', 1131)
('yourself', 1132)
('<|endoftext|>', 1133)
('<|unk|>', 1134)


Extending Tokenizer Class to include UNK and ENDOFTEXT tokens.

In [94]:
class TokenizerV2 :
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode (self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" for item in preprocessed]

        token_ids = [self.str_to_int[strngs] for strngs in preprocessed]
        return token_ids
    
    def decode(self, token_ids):
        text = "".join([self.int_to_str[ints] for ints in token_ids])
        return text
        

In [95]:
tokenizerV2_instance = TokenizerV2(vocab2)  # Use a different variable name for the instance

text1 = "Hello, do you like tea ?"
text2 = "In the sunlit terraces of the palace"
text = "<|endoftext|>".join((text1, text2))
print(text)

Hello, do you like tea ?<|endoftext|>In the sunlit terraces of the palace


In [96]:
encoded = tokenizerV2_instance.encode(text)
print(text, encoded)
decoded = tokenizerV2_instance.decode(encoded)
print(decoded)

Hello, do you like tea ?<|endoftext|>In the sunlit terraces of the palace [1134, 8, 0, 2, 358, 2, 1129, 2, 631, 2, 978, 2, 0, 13, 1134, 2, 991, 2, 959, 2, 987, 2, 725, 2, 991, 2, 1134]
<|unk|>, do you like tea ?<|unk|> the sunlit terraces of the <|unk|>
