# Cleaning Data for LLMs

It is unreasonable to expect taking raw text from a variety of sources and expect them to be ready for large language models. There are a series of steps to get the data ready, from cleaning to vectorizing it. We will focus on cleaning the text data first, covering NLTK and spAcy. 

## The Legend of Sleepy Hollow

For this example, we are going to download the American short story *The Legend of Sleepy Hollow* by Washington Irving. A plain text format [can be found easily at Project Gutenberg](https://www.gutenberg.org/ebooks/41) but we have it downloaded with this notebook for convenience. Let's load the file contents as a string into the `text` variable.

In [3]:
filename = 'legend_of_sleepy_hollow.txt' 
file = open(filename, 'rt')
text = file.read()
file.close()
text

'\ufeffThe Project Gutenberg eBook of The Legend of Sleepy Hollow\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Legend of Sleepy Hollow\n\n\nAuthor: Washington Irving\n\nRelease date: June 27, 2008 [eBook #41]\n                Most recently updated: June 27, 2022\n\nLanguage: English\n\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***\n\n\n\n\nThe Legend of Sleepy Hollow\n\nby Washington Irving\n\n\n\n\nFOUND AMONG THE PAPERS OF THE LATE DIEDRICH KNICKERBOCKER.\n\n\n        A pleasing land of drowsy head it was,\n          Of dreams that wave

Let's then display the contents. 

In [4]:
# display the text 
text

'\ufeffThe Project Gutenberg eBook of The Legend of Sleepy Hollow\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Legend of Sleepy Hollow\n\n\nAuthor: Washington Irving\n\nRelease date: June 27, 2008 [eBook #41]\n                Most recently updated: June 27, 2022\n\nLanguage: English\n\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***\n\n\n\n\nThe Legend of Sleepy Hollow\n\nby Washington Irving\n\n\n\n\nFOUND AMONG THE PAPERS OF THE LATE DIEDRICH KNICKERBOCKER.\n\n\n        A pleasing land of drowsy head it was,\n          Of dreams that wave

Here we can make some observations about our data. 

* Thankfully this is pretty clean text and we do not have to clean up any HTML, PDF markup, or other boilerplate here.
* There is some boilerplate for licensing and other metadata which we may want to remove.
* This book is in English and was not translated from another language.
* We do not anticipate spelling or grammar mistakes.
* There are some interesting hyphenations and historical spellings like "red-tipt" and "yellow-tipt."
* We also have frequent uses of newline `\n` characters and these are artificially injected at every 70 characters.
* There do not seem to be numbers, or at least enough of them, that we have to handle.
* There are names in this document, like Yost Van Houten.

There is a lot more going on here but this is simple enough to get us started. 

If we open up the text file directly in a text editor we will see there are license boilerplate before line 27 and after line 1159. It might be easier to use the keywords that end and start these boilerplate sections respectively. We can use some regular expression patterns for this. 

In [5]:
import re 

text = re.sub(r"^(.|\n)+START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW \*{3}", '', text)
text = re.sub(r"\*{3} END OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW (.|\n)+", '', text)
text = text.strip()

text

'The Legend of Sleepy Hollow\n\nby Washington Irving\n\n\n\n\nFOUND AMONG THE PAPERS OF THE LATE DIEDRICH KNICKERBOCKER.\n\n\n        A pleasing land of drowsy head it was,\n          Of dreams that wave before the half-shut eye;\n        And of gay castles in the clouds that pass,\n          Forever flushing round a summer sky.\n                                         CASTLE OF INDOLENCE.\n\n\nIn the bosom of one of those spacious coves which indent the eastern\nshore of the Hudson, at that broad expansion of the river denominated\nby the ancient Dutch navigators the Tappan Zee, and where they always\nprudently shortened sail and implored the protection of St. Nicholas\nwhen they crossed, there lies a small market town or rural port, which\nby some is called Greensburgh, but which is more generally and properly\nknown by the name of Tarry Town. This name was given, we are told, in\nformer days, by the good housewives of the adjacent country, from the\ninveterate propensity of their h

For this example, we are going to download the American short story *The Legend of Sleepy Hollow* by Washington Irving. A plain text format [can be found easily at Project Gutenberg](https://www.gutenberg.org/ebooks/41) but we have it downloaded with this notebook for convenience. Let's load the file contents as a string into the `text` variable.

## Manual Tokenization

Understandably, if we want to meaningfully prepare this data we will need to split up the words. We will learn how to do this from scratch in Python to understand the process a little bit before we bring in libraries to help us. 

Let's remove the boilerplate at the beginning and end of the document. 

In [6]:
text.split()

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER.',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was,',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half-shut',
 'eye;',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass,',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky.',
 'CASTLE',
 'OF',
 'INDOLENCE.',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson,',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee,',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'St.',
 'Nicholas',
 'when',
 'they',
 'crossed,',
 'th

We can again use [regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) to match whitespace or more elaborate patterns. In this case, hyphenated words are split into separate tokens. 

In [7]:
import re 

words = re.split(r'\W+', text)

words

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half',
 'shut',
 'eye',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 'CASTLE',
 'OF',
 'INDOLENCE',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'St',
 'Nicholas',
 'when',
 'they',
 'crossed',
 'there',


Now let's say we want to remove punctuation. We can get a convenient set of punctuation characters from Python's standard library. 

In [8]:
import re 
import string 

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


We can then construct a character set using a regular expression by using these punctuation characters, and remove said punctuation characters. 

In [9]:
regex_punct = re.compile(f'[{re.escape(string.punctuation)}]')
stripped = [regex_punct.sub('', w) for w in words]
stripped

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half',
 'shut',
 'eye',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 'CASTLE',
 'OF',
 'INDOLENCE',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'St',
 'Nicholas',
 'when',
 'they',
 'crossed',
 'there',


We probabably should concern ourselves with making the casing consistent, as in uppercase or lowercase and making sure one convention is stuck to. 

In [10]:
lowercased = [w.lower() for w in stripped]
lowercased

['the',
 'legend',
 'of',
 'sleepy',
 'hollow',
 'by',
 'washington',
 'irving',
 'found',
 'among',
 'the',
 'papers',
 'of',
 'the',
 'late',
 'diedrich',
 'knickerbocker',
 'a',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 'of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half',
 'shut',
 'eye',
 'and',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 'forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 'castle',
 'of',
 'indolence',
 'in',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'hudson',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'dutch',
 'navigators',
 'the',
 'tappan',
 'zee',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'st',
 'nicholas',
 'when',
 'they',
 'crossed',
 'there',


This was a a simple example, using simple clean text with some simple cleaning operations. This is obviously an idea format to work with text data but it is not always this clean. Sometimes you may have PDF's that have text as images, or social media posts filled with typos and user grammar errors. You may even find domain-specific vocabularly you will not find in a dictionary, or documents with lots of numeric data that really should not be treated as words. You should always strive for simplicity first, and escalate the complexity of the data and its cleaning accordingly. 

## Using NLTK

In [13]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [12]:
from nltk import sent_tokenize

sentences = sent_tokenize(text)
print(sentences[0])

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/Users/thomasnield/nltk_data'
    - '/Users/thomasnield/git/anaconda_data_preparation_llm/.venv/nltk_data'
    - '/Users/thomasnield/git/anaconda_data_preparation_llm/.venv/share/nltk_data'
    - '/Users/thomasnield/git/anaconda_data_preparation_llm/.venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [None]:
import nltk
nltk.download('punkt')