# **In-Class Assignment: NLP Pipeline**
## *IS 5150*
## Name: Carly

In this in-class assignment we're going to run through the entire NLP pipeline and apply some common cleaning and text normalizing steps. We'll start with a text that needs extensive processing to run through the battery of processing steps, then we'll do the same on a much more simple text that requires less effort.

What steps you ned to do will depend on the text and the task at hand!

### Basic Outline of Steps:
1. Import text
2. Remove HTML (if applicable)
3. Case conversion
4. Contractions
5. Stemming/Lemmatization
6. Removing Stopwords
7. Tokenize text
8. Text Output

It's important to note that this list is NOT exhaustive, does NOT need to be done in this order, and which steps you choose WILL depend on the task at hand. The point of this exercise is to show you one procedure for cleaning/processing a text and show two options of output. This will vary based on a given text and what you want to do with it after!

Here, we're going to be using lots of familiar libraries and packages, but we'll also introduce some new ones including the popular and useful `spacy` library! We'll also need `nltk`, `re`, `pprint`, `BeautifulSoup`, `contractions`, `pandas`, and `numpy`.

In [1]:
import nltk, re, pprint

from urllib import request
from bs4 import BeautifulSoup                                                                                   # needed for parsing HTML

pip install contractions
import contractions                                                                                             # contractions dictionary
from string import punctuation

import spacy                                                                                                    # used for lemmatization/stemming
!python -m spacy download en_core_web_sm                # OR in Jupyter download in terminal using spacy download en_core_web_sm

from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
tokenizer = ToktokTokenizer()                                                                                   # stopword removal
from nltk import word_tokenize

import pandas as pd
import numpy as np                                                                                              # general packages for data manipulation

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.8/110.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
Collecting en-core-web-sm==3.6.0
  Downloading h

#### **1) Import Text - UTF-8 Encoded**

For this example we'll run a `Helpful Hints for Halloween` text through the NLP pipeline. Why this text? Well it's pretty messy and provides a good opportunity to demonstrate different processing functions, plus I love Halloween.

In [2]:
url = "https://www.gutenberg.org/cache/epub/68984/pg68984-images.html"
response = request.urlopen(url)

raw = response.read().decode('utf-8-sig')
raw

'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n<meta charset="utf-8"><style>\r\n#pg-header div, #pg-footer div {\r\n    all: initial;\r\n    display: block;\r\n    margin-top: 1em;\r\n    margin-bottom: 1em;\r\n    margin-left: 2em;\r\n}\r\n#pg-footer div.agate {\r\n    font-size: 90%;\r\n    margin-top: 0;\r\n    margin-bottom: 0;\r\n    text-align: center;\r\n}\r\n#pg-footer li {\r\n    all: initial;\r\n    display: block;\r\n    margin-top: 1em;\r\n    margin-bottom: 1em;\r\n    text-indent: -0.6em;\r\n}\r\n#pg-footer div.secthead {\r\n    font-size: 110%;\r\n    font-weight: bold;\r\n}\r\n#pg-footer #project-gutenberg-license {\r\n    font-size: 110%;\r\n    margin-top: 0;\r\n    margin-bottom: 0;\r\n    text-align: center;\r\n}\r\n#pg-header-heading {\r\n    all: inherit;\r\n    text-align: center;\r\n    font-size: 120%;\r\n    font-weight:bold;\r\n}\r\n#pg-footer-heading {\r\n    all: inherit;\r\n    text-align: center;\r\n    font-size: 120%;\r\n    font-weight: normal;\r\n 

**It's clear that we want to remove the HTML tags, and we can use `html.parser` to do that. But that's not going to get rid of all unwanted characters. Let's remove the html and then figure out what else needs to be removed...**

#### **2) Remove HTML Tags + Unwanted Characters & Trim Text**

Let's start by defining a function to remove unwanted html tags, and then we'll build it out based on other characters we want to remove:

In [29]:
def text_cleaner(text):
    soup = BeautifulSoup(text, 'html.parser')
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub('[\r\n|\r\n]+', '\n', stripped_text)
    stripped_text = re.sub(r'\d+', '', stripped_text)
    stripped_text = re.sub('’', '', stripped_text)
    #stripped_text = re.sub(r'[^A]')
    # iteratively add cleaning steps here
    return stripped_text

clean_text = text_cleaner(raw)

In [30]:
clean_text[0:5000]

"\n      The Project Gutenberg eBook of Helps and Hints for Halloween, by Laura Rountree Smith.\n    \nThe Project Gutenberg eBook of Helps and hints for Hallowe'en\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\nTitle: Helps and hints for Hallowe'en\nAuthor: Laura Rountree Smith\nRelease date: September ,  [eBook #]\nLanguage: English\nOriginal publication: United States: March Brothers, \nCredits: Charlene Taylor and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive/American Libraries.)\n*

**Now let's find the beginning and end of the text and trim it:**

In [31]:
print("[", clean_text.find("START OF THE PROJECT "), ":", clean_text.rfind("END OF THE PROJECT"), "]")

[ 985 : 69895 ]


In [32]:
clean_text = clean_text[ 985 : 69895 ]# trim the text
clean_text

"START OF THE PROJECT GUTENBERG EBOOK HELPS AND HINTS FOR HALLOWE'EN ***\n[]\nHelps and Hints\nfor\nHalloween\nBy\nLaura Rountree Smith\nMARCH BROTHERS, Publishers\n, ,  Wright Ave., Lebanon, Ohio\n[]\nCOPYRIGHT, , By\nMARCH BROTHERS\n[]\nContents\nPAGE\nIntroduction\n\nParty Suggestions:\nNut-Crack Night\n\nHalloween Stunts:\nA Shadow Play\n\nThe Black Cat Stunt\n\nA Pumpkin Climbing Game\n\nExercises:\nHalloween Acrostic\n\nTake Care, Tables are Turned!\n\nDrills:\nClown Drill and Song\n\nAutumn Leaf Drill\n\nCat-Tail Drill\n\nMuff Drill\n\nDialogs and Plays:\nThe Halloween Ghosts\n\nOn Halloween Night\n\nJack Frosts Surprise\n\nAn Historical Halloween\n\nThe Witchs Dream\n\nA Halloween Carnival and Wax-Work Show\n\nThe Play of Pomona\n\nHalloween Puppet Play\n\n[]\nNOTE\nSEND FOR OUR COMPLETE\nCATALOG IN WHICH WILL BE\nFOUND ALL THE ACCESSORIES\nNEEDED IN CARRYING OUT THE\nIDEAS GIVEN IN THIS BOOK.\nMarch Brothers, Publishers\n, ,  Wright Ave., Lebanon, Ohio\n[]\nIntroduction\nHist!

In [None]:
print(clean_text[0:1000])

### **3) Lowercase**

**Next in the pipeline is setting all characters to lowercase. Why do we care about doing this?**

The uppercase characteristic of letters in a text dataset is not relevant to most analysis. It will also eliminate the uniquness of words that are the same but one has a capital letter. We do not want to have two of the same words in a feature space just because one starts at the beginning of a sentence.

In [None]:
def lowercase(text):
  sents_lower = text.lower()
  return sents_lower

lower_text = lowercase(clean_text) # apply to clean_text
print(lower_text[0:1000])

#### **4) Contractions**

Contractions are kind of an interesting thing to deal with; we often treat them as one entity but for NLP purposes we often want to separate them out into their two constituents. The `contractions` library contains a list of predefined contractions and their expansions. We will implement that here in the context of a `expand_contractions` function we will define.

In [None]:
contractions.contractions_dict

In [38]:
text_1 = "I didn't even know it's a big deal."

# Add in comments
def expand_contractions(text):
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
        expanded_text = ' '.join(expanded_words)
    return expanded_text

expand_contractions(text_1)

'I did not even know it is a big deal.'

In [None]:
expanded_text = expand_contractions(lower_text)# apply to lower_text
print(expanded_text)

#### **5) Removing Stopwords**

Next, we'll define a function to filter out stop words based on a stopwords list from `nltk`. This process involves firs tokenizing the text, removing extra whitespace, removing tokens in the stopword list, and then finally rejoining all the remaining words back into a continuous string of text.

**Removal of stopwords isn't required, but it is common. Why do you think this is the case?**

I can see some contexts where keeping the stopwords would be helpful, such as in training a chatbot or analyzing different parts of speech in a specific type of writing. However, for most cases, the data will be handled differently and for different reasons. It is common to remove the stopwords because it drastically reduces the complexity and noise of the data set. If we are trying to derive topics, specific informations, or perform sentiment analysis, having only relevant words will improve the accuracy of each of these tasks.

### **Let's add some comments to see what we're doing here...**

In [43]:
nltk.download('stopwords')
tokenizer = ToktokTokenizer()
stopword_list = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = [token.strip().lower() for token in tokenizer.tokenize(text)]
    filtered_tokens = [token for token in tokens if token not in stopword_list] # fill in
    return ' '.join(filtered_tokens)# finish statement

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
stopword_list# view list of stopwords

In [46]:
stopped_text = remove_stopwords(expanded_text)# apply to expanded_text

In [None]:
stopped_text

#### **6) Lemmatization**

Lemmatization is another processing step that isn't required, but often implementd. Remember that lemmatization is different from stemming in that it attempts to reduce words to their roots (or lemmas), where as stemming simply cuts off suffixes and affixes.

Here we will implement a pretrained lemmatizer from `Spacy`.

**Why might we be interested in applying lemmatization?**

For a similar reason that we lowercase all of the letters. We do not want duplicate of words with slightly differences just because of context. In a feature space, we want more simplicity. We want [run] in the feature space, not [run, runs, running, ran]. Simplifying the feature space will give us greater model accuracy and reduce the computational demand of the model.   

In [None]:
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe('lemmatizer')# bring in spacy lemmatizer

def lemmatize_text(text):
  text = nlp(text)
  text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
  return text

lemmas = lemmatize_text(stopped_text)  # apply to stopped_text
lemmas

#### **7) Sentence Tokenize Text**

Though we've applied word tokenization at other steps in the NLP pipeline and then rejoined our text, we are now ready to tokenize the text into sentences, so that we can put it into a structured format like a dataframe or list.

We will use the `PunktSentenceTokenizer` from `nltk` to perform this step:

In [52]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()

sents = punkt_st.tokenize(lemmas) # apply to lemmas
sents[3:15] # view some sentences

['march brother , publisher , , wright ave .',
 ', lebanon , ohio [ ] introduction hist !',
 'still !',
 'halloween , fairy troop across green !',
 'halloween elve witch abroad , find custom world build bonfire , keep evil spirit ; night night entertain friend stunt similar perform two hundred year ago .',
 'night fortunes tell , game play , happen birthday fall night , may even able hold converse fairy — go ancient superstition !',
 ', careful halloween , whenever come ; , careful halloween , witch !',
 'halloween origin old druid festival .',
 'druid keep fire burn year honor sun - god .',
 'last night october , meet altar fire burn , put much pomp ceremony , relighte they .',
 'take ember new fire , return home kindle fire hearth .',
 'superstition , home one these [ ] fire burn constantly , throughout year , protect evil .']

#### **8) Deciding clean text output**

Finally, we need to decide how to structure our cleaned text. This is going to depend on what we want to do with it next (which we'll cover in Topic 4). For now, let's store our sentence tokens in a dataframe, and then we'll store our vocab in a list.

**Output is a dataframe of sentences:**

In [54]:
df = pd.DataFrame(sents, columns = ['sentence'])
df

Unnamed: 0,sentence
0,start project gutenberg ebook help hint hallow...
1,", lebanon , ohio [ ] copyright , , march broth..."
2,drill : clown drill song autumn leaf drill cat...
3,"march brother , publisher , , wright ave ."
4,", lebanon , ohio [ ] introduction hist !"
...,...
1004,"call , appear ."
1005,"[ direction make puppet manipulation find "" pu..."
1006,cent .
1007,order publisher book . ]


#### **Output is a list of unique words:**

In [55]:
words = nltk.wordpunct_tokenize(stopped_text)
text = nltk.Text(words)

In [None]:
vocab = sorted(set(text))
vocab

In [57]:
len(vocab)

1852

## **Basic NLP Pipeline**

We can also take a more basic approach and throw everything into one function, which can be helpful for less complicated texts.

In [58]:
url = "https://gutenberg.org/files/68667/68667-h/68667-h.htm"

html = request.urlopen(url).read()

In [None]:
raw = BeautifulSoup(html).get_text()
print(raw)

In [61]:
print("[", raw.find("A LOVERS’ PROLOGUE"), ":", raw.rfind("CHAPTER III"), "]")

[ 1435 : 363426 ]


In [None]:
raw = raw[1435 : 363426]
print(raw)

In [65]:
nltk.download('punkt')
def basic_text_cleaner(text):
    # Remove characters that are not letters, whitespaces, or periods
    text = re.sub(r'[^A-za-z0-9\s\.]', '', text)
    # Tokenize and perform stopword removal, and casefolding
    tokens = word_tokenize(text)
    tokens = [token.lower() for token in tokens if token.lower() not in stopword_list]

    # Join tokens and trim extra whitespace
    cleaned_text = ' '.join(tokens).strip()

    return cleaned_text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [68]:
cleaned_text = basic_text_cleaner(raw)
cleaned_text

