# **ICE-3: Text Preprocessing Beyond Tokenization**

This notebook focuses on preprocessing English text.

In [1]:
import re

# for using NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords

# for using SpaCy 
import spacy

# for HuggingFace
# !pip install transformers
# !pip install ftfy

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


ModuleNotFoundError: No module named 'spacy'

In [None]:
# trick to wrap text to the viewing window for this notebook
# Ref: https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## **(Tutorial) Tokenizing text using Spacy**

Following is a dummy sample of text to demonstrate tokenization in SpaCy. 

In [None]:
dummy_text1 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text1)

In [None]:
# loads a trained English pipeline with specific preprocessing components
nlp = spacy.load('en_core_web_sm')

# using SpaCy's tokenizer...
doc = nlp(dummy_text1)      # applies the processing pipeline on the text
for token in doc:
  print(token.text)

### **Task 1. Revisiting Tokenization**

Whitespace-based tokenization is a naive approach to tokenize text, where the idea is to extract words that are separated by whitespace characters on either sides.


#### **Question 1a. Implement the naive approach of tokenizing words (whitespace-based) for the text given in the code block below.**

**Important Note:** 
1. DO NOT use any of the existing implementations for tokenization distributed as part of open-source NLP libraries.
2. **If your solution uses readily available implementations of tokenizers, you will receive zero credit for this question.**
3. Avoid putting additional effort to make an advanced implementation of the custom tokenizer. All your implementation needs to do is split words based on whitespace characters and that's all is expected for this question.
4. You can ignore punctuation while building your tokenizer

In [None]:
inau_text="""The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789. After taking his oath of office on the balcony of Federal Hall in New York City, Washington proceeded to the Senate chamber where he read a speech before members of Congress and other dignitaries. His second Inauguration took place in Philadelphia on March 4, 1793, in the Senate chamber of Congress Hall. There, Washington gave the shortest Inaugural address on record—just 135 words —before repeating the oath of office.
Every President since Washington has delivered an Inaugural address. While many of the early Presidents read their addresses before taking the oath, current custom dictates that the Chief Justice of the Supreme Court administer the oath first, followed by the President’s speech.
William Henry Harrison delivered the longest Inaugural address, at 8,445 words, on March 4, 1841—a bitterly cold, wet day. He died one month later of pneumonia, believed to have been brought on by prolonged exposure to the elements on his Inauguration Day. John Adams’ Inaugural address, which totaled 2,308 words, contained the longest sentence, at 737 words. After Washington’s second Inaugural address, the next shortest was Franklin D. Roosevelt’s fourth address on January 20, 1945, at just 559 words. Roosevelt had chosen to have a simple Inauguration at the White House in light of the nation’s involvement in World War II.
In 1921, Warren G. Harding became the first President to take his oath and deliver his Inaugural address through loud speakers. In 1925, Calvin Coolidge’s Inaugural address was the first to be broadcast nationally by radio. And in 1949, Harry S. Truman became the first President to deliver his Inaugural address over television airwaves.
Most Presidents use their Inaugural address to present their vision of America and to set forth their goals for the nation. Some of the most eloquent and powerful speeches are still quoted today. In 1865, in the waning days of the Civil War, Abraham Lincoln stated, “With malice toward none, with charity for all, with firmness in the right as God gives us to see the right, let us strive on to finish the work we are in, to bind up the nation’s wounds, to care for him who shall have borne the battle and for his widow and his orphan, to do all which may achieve and cherish a just and lasting peace among ourselves and with all nations.” In 1933, Franklin D. Roosevelt avowed, “we have nothing to fear but fear itself.” And in 1961, John F. Kennedy declared, “And so my fellow Americans: ask not what your country can do for you—ask what you can do for your country.”
Today, Presidents deliver their Inaugural address on the West Front of the Capitol, but this has not always been the case. Until Andrew Jackson’s first Inauguration in 1829, most Presidents spoke in either the House or Senate chambers. Jackson became the first President to take his oath of office and deliver his address on the East Front Portico of the U.S. Capitol in 1829. With few exceptions, the next 37 Inaugurations took place there, until 1981, when Ronald Reagan’s Swearing-In Ceremony and Inaugural address occurred on the West Front Terrace of the Capitol. The West Front has been used ever since."""

# add your code below this comment and execute it once you have written the code



#### **Question 1b. For the same text in Q1., apply the tokenizers listed below. Analyze how the words are being tokenized by each of the tokenizers. Compare and contrast the outputs of the two tokenization schemes.**
1. **NLTK's tokenizer**
2. **SpaCy's tokenizer**

**Note:** You are already familiar with using NLTK's tokenization which was demosntrated in the previous labs. If you do not remember, just revisit them to refresh your memory.

In [None]:
inau_text="""The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789. After taking his oath of office on the balcony of Federal Hall in New York City, Washington proceeded to the Senate chamber where he read a speech before members of Congress and other dignitaries. His second Inauguration took place in Philadelphia on March 4, 1793, in the Senate chamber of Congress Hall. There, Washington gave the shortest Inaugural address on record—just 135 words —before repeating the oath of office.
Every President since Washington has delivered an Inaugural address. While many of the early Presidents read their addresses before taking the oath, current custom dictates that the Chief Justice of the Supreme Court administer the oath first, followed by the President’s speech.
William Henry Harrison delivered the longest Inaugural address, at 8,445 words, on March 4, 1841—a bitterly cold, wet day. He died one month later of pneumonia, believed to have been brought on by prolonged exposure to the elements on his Inauguration Day. John Adams’ Inaugural address, which totaled 2,308 words, contained the longest sentence, at 737 words. After Washington’s second Inaugural address, the next shortest was Franklin D. Roosevelt’s fourth address on January 20, 1945, at just 559 words. Roosevelt had chosen to have a simple Inauguration at the White House in light of the nation’s involvement in World War II.
In 1921, Warren G. Harding became the first President to take his oath and deliver his Inaugural address through loud speakers. In 1925, Calvin Coolidge’s Inaugural address was the first to be broadcast nationally by radio. And in 1949, Harry S. Truman became the first President to deliver his Inaugural address over television airwaves.
Most Presidents use their Inaugural address to present their vision of America and to set forth their goals for the nation. Some of the most eloquent and powerful speeches are still quoted today. In 1865, in the waning days of the Civil War, Abraham Lincoln stated, “With malice toward none, with charity for all, with firmness in the right as God gives us to see the right, let us strive on to finish the work we are in, to bind up the nation’s wounds, to care for him who shall have borne the battle and for his widow and his orphan, to do all which may achieve and cherish a just and lasting peace among ourselves and with all nations.” In 1933, Franklin D. Roosevelt avowed, “we have nothing to fear but fear itself.” And in 1961, John F. Kennedy declared, “And so my fellow Americans: ask not what your country can do for you—ask what you can do for your country.”
Today, Presidents deliver their Inaugural address on the West Front of the Capitol, but this has not always been the case. Until Andrew Jackson’s first Inauguration in 1829, most Presidents spoke in either the House or Senate chambers. Jackson became the first President to take his oath of office and deliver his address on the East Front Portico of the U.S. Capitol in 1829. With few exceptions, the next 37 Inaugurations took place there, until 1981, when Ronald Reagan’s Swearing-In Ceremony and Inaugural address occurred on the West Front Terrace of the Capitol. The West Front has been used ever since."""

# add your code below this comment and execute it once you have written the code.
# you can additional code cells if need be. make sure to use the text cell provided to answer the question.



**Answer for Q1b.** Type in your answer here!


---


## **(Tutorial) Stemming and Lemmatization using NLTK**

Let's see how we can perform stemming and lemmatization using NLTK library...

In [None]:
# importing PorterStemmer class from nltk.stem module
from nltk.stem import PorterStemmer
porter = PorterStemmer()    # instantiating an object of the PorterStemmer class

stem = porter.stem('cats')    # calling the stemmer algorithm on the desired word
print(f"'cats' after stemming: {stem}")

stem = porter.stem('better')
print(f"'better' after stemming: {stem}")

stem = porter.stem('abaci')
print(f"'abaci' after stemming: {stem}")

stem = porter.stem('aardwolves')
print(f"'aardwolves' after stemming: {stem}")

stem = porter.stem('generically')
print(f"'generically' after stemming: {stem}")

In [None]:
# importing WordNet-based lemmatizer class from nltk.stem module
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()    # instantiating an object of the WordNetLemmatizer class

lemma = lemmatizer.lemmatize('cats')    # calling the lemmatization algorithm on the desired word
print(f"'cats' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('better')
print(f"'better' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('abaci')
print(f"'abaci' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('aardwolves')
print(f"'aardwolves' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('generically')
print(f"'generically' after lemmatization: {lemma}")

print("\n\n\n")
lemma = lemmatizer.lemmatize('better', pos='a')   # 'a' denoted ADJECTIVE part-of-speech
print(f"'better' (as an adjective) after lemmatization: {lemma}")

### **Task 2: Lemmatization or Stemming?**




Following is the text that you will be using for this task (Task 2 only):

In [None]:
# This is the text on which you have to perform stemming; taken from Wikipedia.
text = "In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root."
print("Given text:")
print(text)

Performing some preprocessing that we have learnt in previous ICEs...

In [None]:
en_stopwords = set(stopwords.words('english'))
def remove_punc(text_string):
  return re.sub('[^a-zA-Z0-9 ]', '', text_string.lower())

def remove_stopwords(text_string):
  return [ token for token in text_string.split(' ') if token not in en_stopwords ]

# applying punctuation removal to the text
unpunc_text = remove_punc(text)
print("After punctuation removal:")
print(unpunc_text)

# # applying stopword removal to the text
clean_text = remove_stopwords(unpunc_text)
print("\n\nAfter stopword removal:")
print(clean_text)

#### **Question 2. Perform stemming on the cleaned text above using the Porter Stemmer from NLTK.**

In [None]:
# apply Porter Stemmer on the cleaned text (after punctuation and stopwords are removed) below this comment



#### **Question 3. Perform lemmatization on the same cleaned text above using NLTK's lemmatizer.**

In [None]:
# apply NLTK's lemmatizer on the cleaned text (after punctuation and stopwords are removed) below this comment



#### **Question 4. What do you think is better - Lemmatization or Stemming?**

**IMPORTANT NOTE: Substantiate your answer not just based on your observations from solving Questions 2. and 3. but also from your understanding about language in general. Additionally, think about if there are cases where one performs better than the other.**

**Answer for Q4.:** Type your answer here!

---


## **(Tutorial) Sentence Segmentation using Spacy**

Following is a dummy paragraph of text to demonstrate how to use SpaCy to segment text into sentences.

In [None]:
dummy_text3 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text3)

In [None]:
nlp = spacy.load('en_core_web_sm')

# performing sentence splitting...
doc = nlp(dummy_text3)
for sentence in doc.sents:
  print(sentence)


### **Task 3. Segmenting Sentences**

For this task, we will be using the [*Inaugural Address*](https://www.inaugural.senate.gov/inaugural-address/) text excerpt that we used in ICE-2.

In [None]:
inau_text="""The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789. After taking his oath of office on the balcony of Federal Hall in New York City, Washington proceeded to the Senate chamber where he read a speech before members of Congress and other dignitaries. His second Inauguration took place in Philadelphia on March 4, 1793, in the Senate chamber of Congress Hall. There, Washington gave the shortest Inaugural address on record—just 135 words —before repeating the oath of office.
Every President since Washington has delivered an Inaugural address. While many of the early Presidents read their addresses before taking the oath, current custom dictates that the Chief Justice of the Supreme Court administer the oath first, followed by the President’s speech.
William Henry Harrison delivered the longest Inaugural address, at 8,445 words, on March 4, 1841—a bitterly cold, wet day. He died one month later of pneumonia, believed to have been brought on by prolonged exposure to the elements on his Inauguration Day. John Adams’ Inaugural address, which totaled 2,308 words, contained the longest sentence, at 737 words. After Washington’s second Inaugural address, the next shortest was Franklin D. Roosevelt’s fourth address on January 20, 1945, at just 559 words. Roosevelt had chosen to have a simple Inauguration at the White House in light of the nation’s involvement in World War II.
In 1921, Warren G. Harding became the first President to take his oath and deliver his Inaugural address through loud speakers. In 1925, Calvin Coolidge’s Inaugural address was the first to be broadcast nationally by radio. And in 1949, Harry S. Truman became the first President to deliver his Inaugural address over television airwaves.
Most Presidents use their Inaugural address to present their vision of America and to set forth their goals for the nation. Some of the most eloquent and powerful speeches are still quoted today. In 1865, in the waning days of the Civil War, Abraham Lincoln stated, “With malice toward none, with charity for all, with firmness in the right as God gives us to see the right, let us strive on to finish the work we are in, to bind up the nation’s wounds, to care for him who shall have borne the battle and for his widow and his orphan, to do all which may achieve and cherish a just and lasting peace among ourselves and with all nations.” In 1933, Franklin D. Roosevelt avowed, “we have nothing to fear but fear itself.” And in 1961, John F. Kennedy declared, “And so my fellow Americans: ask not what your country can do for you—ask what you can do for your country.”
Today, Presidents deliver their Inaugural address on the West Front of the Capitol, but this has not always been the case. Until Andrew Jackson’s first Inauguration in 1829, most Presidents spoke in either the House or Senate chambers. Jackson became the first President to take his oath of office and deliver his address on the East Front Portico of the U.S. Capitol in 1829. With few exceptions, the next 37 Inaugurations took place there, until 1981, when Ronald Reagan’s Swearing-In Ceremony and Inaugural address occurred on the West Front Terrace of the Capitol. The West Front has been used ever since."""

print(inau_text)

#### **Question 5a. Implement a custom Python script that performs a simple way of segmenting sentences in the text above by using the period (.) character as the sentence boundary. Analyze the generated output and provide your observations.**

**Note:** You do not need to remove any stopwords, punctuation or apply any kind of other preprocessing techniques. Only perform what's asked to minimize your effort needed to answer this question. 

**Hint**: Use print( ) to help you understand how the sentences are being split when analyzing your output to note down your observations.

In [None]:
# write your code below this comment



#### **Question 5b. Using SpaCy, perform sentence segmentation on the same text (that was used in Q5a.). Analyze the generated output and provide your observations.**

**Hint**: Use print( ) to help you understand how the sentences are being split when analyzing your output to note down your observations.

In [None]:
# write your code below this comment





---



## **(Tutorial) Subword Tokenization using HuggingFace**

In [None]:
!pip install tokenizers

!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

In [None]:
from tokenizers import ByteLevelBPETokenizer
gpt2vocab = "gpt2-medium-vocab.json"
gpt2merges = "gpt2-merges.txt"

bpe = ByteLevelBPETokenizer(gpt2vocab, gpt2merges)
bpe_encoding = bpe.encode("The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789.")
print(bpe_encoding.ids)
print(bpe_encoding.tokens)

### **Task 4: Understanding Subword Tokenization**

Consider the following two sentences:

* I like yellow roses better than red ones.
* Looks like John is bettering the working conditions at his organization.


**Question 6. Encode these sentences using the Byte-Pair Encoding tokenizer (created during the tutorial). Retrieve the tokens from the encodings of the two sentences. Is/Are there any interesting observations when you compare the tokens between the two encodings? What do you think is causing what you observe as part of your comparison?**

In [None]:
# use the bpe tokenizer that was created during the tutorial to encode the sentences
# write your code below this comment and execute
# type in your answer to the question asked above in the following cell (see below)



**Answer for Q6.:** Type your answer in here!

---

## **References**
* https://spacy.io/usage/spacy-101
* https://spacy.io/models/en
* https://www.nltk.org/howto/wordnet.html
* https://www.nltk.org/_modules/nltk/stem/wordnet.html