# Cleaning Data for LLMs

It is unreasonable to expect taking raw text from a variety of sources and expect them to be ready for large language models. There are a series of steps to get the data ready, from cleaning to vectorizing it. We will focus on cleaning the text data first, covering NLTK and spAcy. 

## The Legend of Sleepy Hollow

For this example, we are going to download the American short story *The Legend of Sleepy Hollow* by Washington Irving. A plain text format [can be found easily at Project Gutenberg](https://www.gutenberg.org/ebooks/41) but we have it downloaded with this notebook for convenience. Let's load the file contents as a string into the `text` variable.

In [None]:
filename = 'legend_of_sleepy_hollow.txt' 
file = open(filename, encoding="utf-8")
text = file.read()
file.close()
text

Let's then display the contents. 

In [None]:
# display the text 
text

Here we can make some observations about our data. 

* Thankfully this is pretty clean text and we do not have to clean up any HTML, PDF markup, or other boilerplate here.
* There is some boilerplate for licensing and other metadata which we may want to remove.
* This book is in English and was not translated from another language.
* We do not anticipate spelling or grammar mistakes.
* There are some interesting hyphenations and historical spellings like "red-tipt" and "yellow-tipt."
* We also have frequent uses of newline `\n` characters and these are artificially injected at every 70 characters.
* There do not seem to be numbers, or at least enough of them, that we have to handle.
* There are names in this document, like Yost Van Houten.

There is a lot more going on here but this is simple enough to get us started. 

If we open up the text file directly in a text editor we will see there are license boilerplate before line 27 and after line 1159. It might be easier to use the keywords that end and start these boilerplate sections respectively. We can use some regular expression patterns for this. 

In [None]:
import re 

text = re.sub(r"^(.|\n)+START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW \*{3}", '', text)
text = re.sub(r"\*{3} END OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW (.|\n)+", '', text)
text = text.strip()

text

For this example, we are going to download the American short story *The Legend of Sleepy Hollow* by Washington Irving. A plain text format [can be found easily at Project Gutenberg](https://www.gutenberg.org/ebooks/41) but we have it downloaded with this notebook for convenience. Let's load the file contents as a string into the `text` variable.

## Manual Tokenization

Understandably, if we want to meaningfully prepare this data we will need to split up the words. We will learn how to do this from scratch in Python to understand the process a little bit before we bring in libraries to help us. 

Let's remove the boilerplate at the beginning and end of the document. 

In [None]:
text.split()

We can again use [regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) to match whitespace or more elaborate patterns. In this case, hyphenated words are split into separate tokens. 

In [None]:
import re 

words = re.split(r'\W+', text)

words

Now let's say we want to remove punctuation. We can get a convenient set of punctuation characters from Python's standard library. 

In [None]:
import re 
import string 

print(string.punctuation)

We can then construct a character set using a regular expression by using these punctuation characters, and remove said punctuation characters. 

In [None]:
regex_punct = re.compile(f'[{re.escape(string.punctuation)}]')
stripped = [regex_punct.sub('', w) for w in words]
stripped

We probabably should concern ourselves with making the casing consistent, as in uppercase or lowercase and making sure one convention is stuck to. 

In [None]:
lowercased = [w.lower() for w in stripped]
lowercased

This was a a simple example, using simple clean text with some simple cleaning operations. This is obviously an ideal format to work with text data but it is not always this clean. Sometimes you may have PDF's that have text as images, or social media posts filled with typos and user grammar errors. You may even find domain-specific vocabularly you will not find in a dictionary, or documents with lots of numeric data that really should not be treated as words. You should always strive for simplicity first, and escalate the complexity of the data and its cleaning accordingly. 

## Using NLTK

The Natural Language Toolkit (NLTK) is a Python library for processing and working with text. We can use it clean text and get it read for machine learning applications. 

You will need to install NLTK using pip. 

```
pip install -U nltk
```

You will also need to download all the data for the library. 

```
python -m nltk.downloader all```


### Breaking Up Words

We can split up words in NLTK using the `word_tokenize()` function. It will split on white space and punctuation including commas, periods, and contractions like `what's -> what 's`. 

In [None]:
from nltk.tokenize import word_tokenize
words = word_tokenize(text)
words

You will see here that the tokens above have punctionation marks as separate tokens. We can filter those out if we like using the `is_alpha()` function. 

In [None]:
no_puncts = [w for w in words if w.isalpha()]
no_puncts

### Breaking Up Sentences

Another way we can process this text is to break it up into sentences rather than words. We can bring in the `sent_tokenize()` function from NLTK to achieve this. We can then grab the 25th sentence in the story. 

In [None]:
from nltk import sent_tokenize

sentences = sent_tokenize(text)
print(sentences[25])

### Stop Words

Another task you might consider doing is removing **stop words**, which are words that bear little meaning like *the* and *is*. You look at stopwords available for English in NLTK. 

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

We can take these stop words, package them into a set, and remove them from our text. Note because the stop words are in lower case, we should compare each word in lower case as well. 

In [None]:
no_stop_words = [w for w in no_puncts if not w.lower() in stop_words]
no_stop_words

### Stemming 

There might be times you want to reduce each word to its root or base. The words *fighter* and *fighting* stem from *fight*. This can help reduce the vocabularly and find broader tones or sentiments in the document. The most popular stemming algorithm is the Porter Stemming algorithm which NLTK has available. 

In [None]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

stemmed = [porter.stem(word) for word in no_stop_words]
stemmed

There are also **lemmatization** tools in NLTK, which help group and consolidate terms. For example, "better" has the word "good" as its lemma, and "was" has "be." We will talk more about lemmatization with spaCy. 

## Using spaCy

While NLTK is a great library, another that has grown popular for its scalability and efficiency is [spaCy](https://spacy.io/). We'll cover a few of its features here.  

First install spaCy as well as its English model. 

```
pip install spacy
python -m spacy download en_core_web_sm
```

After that, you should be set to run spacy. 

In [None]:
import spacy 
nlp = spacy.load("en_core_web_sm")
nlp

Let's load up Sleepy Hollow but this time into a spaCy doc. 

In [None]:
sleepy_hollow = nlp(text)
type(sleepy_hollow)

We can traverse the text tokens. 

In [None]:
[token.text for token in sleepy_hollow]

We can also traverse the sentences, which are packaged into `Span` objects. 

In [None]:
[token.text for token in sleepy_hollow.sents]

There are a lot of helpful attributes with each token in spaCy. Below we iterate a handful of tokens from the Sleepy Hollow document and print a few attributes we learned about previously. 

In [None]:
for token in sleepy_hollow[50:60]: 
    print(f"Index: {token.idx}")
    print(f"Text: {token.text}")
    print(f"Is Alpha: {token.is_alpha}")
    print(f"Is Punctuation: {token.is_punct}")
    print(f"Is Stop Word: {token.is_stop}\n\n")
    

You can also implement your own tokenization procedures but we will keep the scope focused for now. Let's take a look at the lemmatization of each token. Sure enough, spaCy will find the lemma of each word. 

In [None]:
for token in sleepy_hollow: 
    if token.is_alpha:
        print(f"{token.text} -> {token.lemma_}")
    

This should give us enough tools and exposure to text cleaning. Just be wary that how you clean your text data is really driven by what you want to achieve and the state of the data itself. We had a nice clean short story to work with here, with an ideal UTF8 text body with no markup from HTML or PDF. There will be times you have to handle domain specific words and language, and decide to remove mathematical symbols like numbers and dates which may not be useful for your language model. Then there are simple but tedious matters like typos and errors, all of which might need to be handled for your large language model. 

Consider saving and documenting your cleaning steps too! Make reusable pipelines for your projects and perhaps even save the cleaned documents. 

## Exercise

Take this excerpt from an Edgar Allen Poe poem and tokenize it with the tool of your choice. 

In [None]:
text = "Once upon a midnight dreary, While I pondered, weak and weary"

# build your model below 

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

## Using NLTK

In [None]:
from nltk.tokenize import word_tokenize
poem = word_tokenize(text)

for w in poem: 
    print(w)

### Using spaCy

In [None]:
import spacy 
nlp = spacy.load("en_core_web_sm")
poem = nlp(text)

for w in poem: 
    print(w)