# Data Preparation

We have used the text called verdict, which is a short story containing roughly around 25,000 characters. The txt file is saved in resources directory and we will read it next, to explore the text in it.

In [1]:
# importing necessary packages
from pathlib import Path

In [2]:
with open(Path("../resources/verdict.txt"), "r", encoding="utf-8") as f:
    verdict = f.read()

print(f"length of the text {len(verdict)}")
print("\n", verdict[:99])

# We have now confirmed the length of the text, and the also printed the first 99 characters and 
# the length includes the spaces

length of the text 20559

 The Verdict: Edith Wharton: 1908
Exported from Wikisource on October 21, 2024

I HAD always thought


## Simple Tokenisation

Here we will tokenise the text into words and special characters, we will start with a regular expression approach and then later switch to a more sophisticated approach such as Byte Pair Encodings using a python package.

In [3]:
import re

#simple example
text = "Hello, world. This, is a test."
result = re.split(r"(\s)",text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Here we noted that the strings are still connected with the special characters and the idea would be to split them too, so that we have words and special characters by themselves.

In [4]:
result = re.split(r"([.,]|\s)", text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


An important note concerning tokenisation, here we can split the words and spaces but if our model needs to understand the nuances of generating code then getting rid of spaces or tabs can be detremental to the performance of the model so much so that it will generate code thats not atleast entirely executable and would require some work to get it in correct shape. 

In [5]:
text = "Hello, world. Is this-- a test?"
regex_logic = r"([,.:;?_!\"()']|--|\s)"
result = re.split(regex_logic, text) # r represents raw string literal, 
# which tells python to interpret backslashes in the string as escape characters.
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Is', ' ', 'this', '--', '', ' ', 'a', ' ', 'test', '?', '']


In [6]:
# We can further remove the spaces between the characters.
# note that strip strips the text on space and returns a list without spaces.
print([item for item in result if item.strip()])

new_result = [item.strip() for item in result if item.strip()]
print(f"new result --> {new_result}")

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']
new result --> ['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


Now lets apply to the verdict text

In [7]:
preprocessed_text = re.split(regex_logic, verdict)
preprocessed_text = [text.strip() for text in preprocessed_text if text.strip()]
print(len(preprocessed_text))

4705


In [8]:
print(preprocessed_text[:30])

['The', 'Verdict', ':', 'Edith', 'Wharton', ':', '1908', 'Exported', 'from', 'Wikisource', 'on', 'October', '21', ',', '2024', 'I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow']
