# **Lab 1.1 Text Preprocessing**

In this lab session, you will learn reexplore some important tasks in text preprocessing. This will include tokenization, normalization, stopword removal and stemming/lemmatization. 

## **Tokenizer**
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

Tokenizers can be used to divide strings into lists of substrings. For example, Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.


**Sentence Tokenizer**

- breaks text paragraph into sentences





**Word tokenizer**

- breaks text paragraph into words


**How to sentence tokenize in NLTK**

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from nltk.tokenize import sent_tokenize

In [3]:
text = 'this’s a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it’s your turn.'


In [4]:
sent_tokenize_list = sent_tokenize(text) #tokenizing the sentence

In [5]:
sent_tokenize_list

['this’s a sent tokenize test.',
 'this is sent two.',
 'is this sent three?',
 'sent 4 is cool!',
 'Now it’s your turn.']

In [6]:
len(sent_tokenize_list) #length of the sentences

5

**Tokenizing text into words**


Need to call **word_tokenize** from **nltk.tokenize** module:


In [7]:
from nltk.tokenize import word_tokenize

word_tokenize("Hello World.")


['Hello', 'World', '.']

In [8]:
word_tokenize("They aren't the best for O'Neill's team!") #tokenizing each word and symbols in the sentence

['They', 'are', "n't", 'the', 'best', 'for', "O'Neill", "'s", 'team', '!']

Try again with other sentences and observe the output.


# **Stopword Removal**
Stopwords are the most common words in any natural language like the words *a, the* and *is*. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document and it should be removed.

Let’s see what are the English stopwords available in nltk


In [9]:

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
print(stop_words)


{'it', 'after', 'during', 're', 'shan', 'and', 'above', 'in', 'couldn', 'which', 'no', 'myself', 'nor', 'didn', 'his', "wouldn't", 'wasn', 'weren', 'few', 'down', 'any', "mustn't", 'on', 'own', 'mustn', 'she', 'only', 'itself', 'its', "it's", 'than', 'will', 'themselves', 'what', 'most', 'now', "shouldn't", 'each', 'between', 'that', 'further', 'needn', 'as', 've', 'when', 'through', "aren't", 'too', 'where', 'yours', 'were', "wasn't", 'been', 'do', 'wouldn', "you're", 'same', 'don', 'those', 'below', 'if', 'can', 'ma', 'other', 'the', 'for', 'herself', 'under', 'while', 'off', 'm', 'shouldn', 'hers', 'then', 'more', 'y', 'there', 'being', 'theirs', "shan't", 'them', "isn't", 'he', 'himself', 'was', 'aren', "haven't", 'of', 'about', "mightn't", 'both', "needn't", 'have', "that'll", 'had', "couldn't", 'her', 'yourself', 'ain', 'who', 'at', 'yourselves', "didn't", 'why', 'i', 'over', 'with', 'but', 'me', 'did', 's', 'is', 'once', 'up', "doesn't", 'has', 'isn', "you'll", 'whom', 'to', 't'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Let’s see how stop word removal can be done using the corpus module. Observe what words are being removed here. 


In [10]:
#import nltk
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize 
#set(stopwords.words('english'))


# credits to Analytics Vidhya
#sample sentence
text = """NLTK supports stop word removal, and you can find the list of stop words in the corpus module. 
To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits 
in the list of stop words provided by NLTK."""

# set of stop words
stop_words = set(stopwords.words('english')) 
# tokens of words  
word_tokens = word_tokenize(text) 
    
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 



print("\n\nOriginal Sentence \n\n")
print(" ".join(word_tokens)) 

print("\n\nFiltered Sentence \n\n")
print(" ".join(filtered_sentence))



Original Sentence 


NLTK supports stop word removal , and you can find the list of stop words in the corpus module . To remove stop words from a sentence , you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK .


Filtered Sentence 


NLTK supports stop word removal , find list stop words corpus module . To remove stop words sentence , divide text words remove word exits list stop words provided NLTK .


# **Normalization**


For certain tasks, one of the simplest form of normalization is converting the text into lowercase letters (where capitalization is not important -for example in text classification). Others would include stemming or lemmatization to reduce a word into its canonical form. 

Let's have a look on how the text can be converted into lowercase by defining a function: 

In [11]:
def changetolower(text):
    lowerText = text.lower()
    print("Before:",text)
    print("After:",lowerText)

In [12]:
changetolower("Bandar Baru Bangi is a town in Selangor")

Before: Bandar Baru Bangi is a town in Selangor
After: bandar baru bangi is a town in selangor


# **Stemming**

Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). There are different algorithms for stemming. The most common algorithm, which is also known to be empirically effective for English, is Porters Algorithm. NLTK provides a variation of stemmers, so let's try and observe the output for each of them.

In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [14]:
def mystemmer(myword):
    stemmer1 = PorterStemmer()
    print("Porter Stemmer's output:", stemmer1.stem(myword))
    stemmer2 = LancasterStemmer()
    print("Lancester Stemmer's output:", stemmer2.stem(myword))

Observe the output and try with other words..

In [15]:
mystemmer('ponies')

Porter Stemmer's output: poni
Lancester Stemmer's output: pony


# **Lemmatizer**

Lemmatization is very similar to stemming, where the goal is to remove inflections and map a word to its root form. The only difference is that, lemmatization doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”. It may use a dictionary such as WordNet for mappings or some special rule-based approaches. Here is an example of lemmatization in action using a WordNet-based approach in NLTK:

In [16]:
def mylemma(myword): # function for lemmatization
    lemmatizer = WordNetLemmatizer() #this lemmatizer uses WordNet as the dictionary/lexical resource
    print(lemmatizer.lemmatize(myword)) 
    

In [17]:
mylemma('flies')

fly


Try the lemmatizer with other words such as 'computerization' or 'flies'

# **LAB TASK 1**

Given the text below:

During the heat of the space race in the 1960's, NASA quickly discovered that ballpoint pens would not work in the zero gravity confines of its space capsules. After considerable research and development, the Astronaut Pen was developed at a cost of $1 million. The pen worked in zero gravity, upside down, underwater, on almost any surface including glass and also enjoyed some modest success as a novelty item back here on earth. The Soviet Union, when faced with the same problem, used a pencil. This has to be the funniest joke ever.


Write a program that performs the task as follows:

1.   toknenize the text
2.   remove the stopwords
3.   change the text into lowercase
4.   perform stemming/lemmatization 






Submit your lab task in UKMFolio by 28 October 2022 (Friday). Make sure to include your name and matric number.  

# **NAME: CHONG WEI YI**

# **MATRIC NO: A180497**

**1.1 Tokenization**

In [18]:
import nltk
from nltk.tokenize import word_tokenize

text = "During the heat of the space race in the 1960's, NASA quickly discovered that ballpoint pens would not work in the zero gravity confines of its space capsules. After considerable research and development, the Astronaut Pen was developed at a cost of $1 million. The pen worked in zero gravity, upside down, underwater, on almost any surface including glass and also enjoyed some modest success as a novelty item back here on earth. The Soviet Union, when faced with the same problem, used a pencil. This has to be the funniest joke ever."
word_tokenize(text)

['During',
 'the',
 'heat',
 'of',
 'the',
 'space',
 'race',
 'in',
 'the',
 '1960',
 "'s",
 ',',
 'NASA',
 'quickly',
 'discovered',
 'that',
 'ballpoint',
 'pens',
 'would',
 'not',
 'work',
 'in',
 'the',
 'zero',
 'gravity',
 'confines',
 'of',
 'its',
 'space',
 'capsules',
 '.',
 'After',
 'considerable',
 'research',
 'and',
 'development',
 ',',
 'the',
 'Astronaut',
 'Pen',
 'was',
 'developed',
 'at',
 'a',
 'cost',
 'of',
 '$',
 '1',
 'million',
 '.',
 'The',
 'pen',
 'worked',
 'in',
 'zero',
 'gravity',
 ',',
 'upside',
 'down',
 ',',
 'underwater',
 ',',
 'on',
 'almost',
 'any',
 'surface',
 'including',
 'glass',
 'and',
 'also',
 'enjoyed',
 'some',
 'modest',
 'success',
 'as',
 'a',
 'novelty',
 'item',
 'back',
 'here',
 'on',
 'earth',
 '.',
 'The',
 'Soviet',
 'Union',
 ',',
 'when',
 'faced',
 'with',
 'the',
 'same',
 'problem',
 ',',
 'used',
 'a',
 'pencil',
 '.',
 'This',
 'has',
 'to',
 'be',
 'the',
 'funniest',
 'joke',
 'ever',
 '.']

***1.2Stopwords Removal***

In [19]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
set(stopwords.words('english'))

text = "During the heat of the space race in the 1960's, NASA quickly discovered that ballpoint pens would not work in the zero gravity confines of its space capsules. After considerable research and development, the Astronaut Pen was developed at a cost of $1 million. The pen worked in zero gravity, upside down, underwater, on almost any surface including glass and also enjoyed some modest success as a novelty item back here on earth. The Soviet Union, when faced with the same problem, used a pencil. This has to be the funniest joke ever."

# set of stop words
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text) 
    
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 



print("\n\nOriginal Sentence \n\n")
print(" ".join(word_tokens)) 

print("\n\nFiltered Sentence \n\n")
print(" ".join(filtered_sentence))



Original Sentence 


During the heat of the space race in the 1960 's , NASA quickly discovered that ballpoint pens would not work in the zero gravity confines of its space capsules . After considerable research and development , the Astronaut Pen was developed at a cost of $ 1 million . The pen worked in zero gravity , upside down , underwater , on almost any surface including glass and also enjoyed some modest success as a novelty item back here on earth . The Soviet Union , when faced with the same problem , used a pencil . This has to be the funniest joke ever .


Filtered Sentence 


During heat space race 1960 's , NASA quickly discovered ballpoint pens would work zero gravity confines space capsules . After considerable research development , Astronaut Pen developed cost $ 1 million . The pen worked zero gravity , upside , underwater , almost surface including glass also enjoyed modest success novelty item back earth . The Soviet Union , faced problem , used pencil . This funn

***1.3 Lowercase***

In [20]:
def changetolower(text):
    lowerText = text.lower()
    print("Before:",text)
    print("After:",lowerText)
  
changetolower(" ".join(filtered_sentence))

Before: During heat space race 1960 's , NASA quickly discovered ballpoint pens would work zero gravity confines space capsules . After considerable research development , Astronaut Pen developed cost $ 1 million . The pen worked zero gravity , upside , underwater , almost surface including glass also enjoyed modest success novelty item back earth . The Soviet Union , faced problem , used pencil . This funniest joke ever .
After: during heat space race 1960 's , nasa quickly discovered ballpoint pens would work zero gravity confines space capsules . after considerable research development , astronaut pen developed cost $ 1 million . the pen worked zero gravity , upside , underwater , almost surface including glass also enjoyed modest success novelty item back earth . the soviet union , faced problem , used pencil . this funniest joke ever .


***1.4 Stemming/Lemmatization***

In [22]:
from nltk.corpus.reader import wordlist
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('omw-1.4')

def mystemmer(myword):
    stemmer1 = PorterStemmer()
    print("Porter Stemmer's output:", stemmer1.stem(myword))
    stemmer2 = LancasterStemmer()
    print("Lancester Stemmer's output:", stemmer2.stem(myword))

def mylemma(myword): # function for lemmatization
    lemmatizer = WordNetLemmatizer() #this lemmatizer uses WordNet as the dictionary/lexical resource
    print(lemmatizer.lemmatize(myword)) 

text = " ".join(filtered_sentence).lower()
print(text)
word_token =word_tokenize(text)

for word in word_token:
  mystemmer(word)

for word in word_token:
  mylemma(word)


during heat space race 1960 's , nasa quickly discovered ballpoint pens would work zero gravity confines space capsules . after considerable research development , astronaut pen developed cost $ 1 million . the pen worked zero gravity , upside , underwater , almost surface including glass also enjoyed modest success novelty item back earth . the soviet union , faced problem , used pencil . this funniest joke ever .
Porter Stemmer's output: dure
Lancester Stemmer's output: dur
Porter Stemmer's output: heat
Lancester Stemmer's output: heat
Porter Stemmer's output: space
Lancester Stemmer's output: spac
Porter Stemmer's output: race
Lancester Stemmer's output: rac
Porter Stemmer's output: 1960
Lancester Stemmer's output: 1960
Porter Stemmer's output: 's
Lancester Stemmer's output: 's
Porter Stemmer's output: ,
Lancester Stemmer's output: ,
Porter Stemmer's output: nasa
Lancester Stemmer's output: nas
Porter Stemmer's output: quickli
Lancester Stemmer's output: quick
Porter Stemmer's outpu

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
