Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# PREPROCESSING

## Tokenization

*Tokenization* is the process of spliting an input text into tokens (words or other relevant elements, such as punctuation).

#### Making use of regular expressions

We can tokenize a piece of text by using a regular expression tokenizer, such as the one available in **NLTK**.

For starters, let's stick to alphanumerical sequences of characters.

In [1]:
import nltk
from nltk import regexp_tokenize

text = 'That U.S.A. poster-print costs $12.40...'

pattern = '[a-zA-Z0-9_]+'
tokens = regexp_tokenize(text, pattern)
print(len(tokens))
print(tokens)

9
['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']


We can refine the regular expression to obtain a more sensible tokenization.

In [2]:
pattern = r'''(?x)           # set flag to allow verbose regexps
        (?:[A-Z]\.)+         # abbreviations, e.g. U.S.A.
        | \w+(?:-\w+)*       # words with optional internal hyphens
        | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
        | \.\.\.             # ellipsis
        | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
        '''

tokens = regexp_tokenize(text, pattern)
print(len(tokens))
print(tokens)

6
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']


#### Using NLTK

NLTK also includes a word tokenizer, which gets roughly the same result (it finds "words" and punctuation).

In [3]:
from nltk import word_tokenize

text = 'That U.S.A. poster-print costs $12.40...'
tokens = word_tokenize(text)

print(len(tokens))
print(tokens)

7
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']


In [4]:
word_tokenize("I don't think we're flying today.")

['I', 'do', "n't", 'think', 'we', "'re", 'flying', 'today', '.']

You can try [other tokenizers](https://www.nltk.org/api/nltk.tokenize.html) available in NLTK.

In [5]:
# try out the wordpunct tokenizer
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize(text)

['That',
 'U',
 '.',
 'S',
 '.',
 'A',
 '.',
 'poster',
 '-',
 'print',
 'costs',
 '$',
 '12',
 '.',
 '40',
 '...']

Let's get a sentence from the user and tokenize it.

In [6]:
import os

s = input("Enter some text:")
tokens = word_tokenize(s)

print("You typed", len(tokens), "words:", tokens)

You typed 4 words: ['ola', 'sou', 'a', 'neni']


#### Sentence segmentation

We may also be interested in spliting the text into sentences.

In [7]:
from nltk import sent_tokenize

text = "Hello. Are you Mr. Smith? Just to let you know that I have finished my M.Sc. and Ph.D. on AI. I loved it!"
sentences = sent_tokenize(text)

print(sentences)
print("Number of sentences:", len(sent_tokenize(text)))

['Hello.', 'Are you Mr. Smith?', 'Just to let you know that I have finished my M.Sc.', 'and Ph.D. on AI.', 'I loved it!']
Number of sentences: 5


#### Experimenting with long texts

We can try downloading a book from Project Gutenberg.

In [8]:
from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

print(len(raw))
print(raw[:75])

1176812
﻿The Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky


How many sentences are there? Printout the second sentence (index 1).

In [21]:
# insert your code here
sentences = sent_tokenize(raw)

print(sentences[1])

You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org.


How many tokens are there? What is the index of the first token in the second sentence?

In [22]:
# insert your code here
tokens = word_tokenize(raw)

print("number of tokens:", len(tokens))

s = sentences[1]
sentence_tokens = word_tokenize(s)
print("index of the first token in the second sentence:", tokens.index(sentence_tokens[0]))

number of tokens: 257058
index of the first token in the second sentence: 42


#### Dealing with multi-word expressions (MWE)

Sometimes we want certain words to stick together when tokenizing, such as in multi-word names.

In [12]:
word_tokenize("Good muffins cost $3.88\nin New York.")

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.']

One way to do it is to suply our own lexicon and make use of NLTK's [MWE tokenizer](https://www.nltk.org/api/nltk.tokenize.mwe.html).

In [13]:
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize

s = "Good muffins cost $3.88\nin New York."
mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator=' ')

[mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]

[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York', '.']]

Try out your own multi-word expressions to tokenize text.

In [14]:
# try out your own multi-word expressions
s = "Estamos a estudar Engenharia Informatica na Universidade do Porto"
mwe = MWETokenizer([('Engenharia', 'Informatica'), ('Universidade', 'do', 'Porto')], separator=' ')

[mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]

[['Estamos',
  'a',
  'estudar',
  'Engenharia Informatica',
  'na',
  'Universidade do Porto']]

## Stemming and Lemmatization

*Stemming* and *Lemmatization* are techniques used to normalize tokens, so as to reduce the size of the vocabulary.
Whereas lemmatization is a process of finding the root of the word, stemming typically applies a set of transformation rules that aim to cut off word final affixes.

#### Stemming

NLTK includes one of the most well-known stemmers: the [Porter stemmer](https://www.emerald.com/insight/content/doi/10.1108/00330330610681286/full/pdf?casa_token=eT_IPtH_eLEAAAAA:Z3lAtxWdxf0FL479mL-A7tC-_QRzxNeeyC2DFLyWwGBlcj6DQcwu2Bnq37waDPcXKOnXkMMDtKGyCaYGZtYcb3lgBZ9uaHKUNO0JCMivSdPE4HTe).

In [15]:
from nltk.stem import PorterStemmer

# initialize the Porter Stemmer
porter = PorterStemmer()

Let's use an illustrative piece of text:

In [16]:
sentence = '''The European Commission has funded a numerical study to analyze the purchase of a pipe organ with no noise
for Europe's organization. Numerous donations have followed the analysis after a noisy debate.'''

# tokenize: split the text into words
word_list = nltk.word_tokenize(sentence)

print("\nOriginal word list:", word_list)
print("\nOriginal number of distinct tokens:", len(set(word_list)))


Original word list: ['The', 'European', 'Commission', 'has', 'funded', 'a', 'numerical', 'study', 'to', 'analyze', 'the', 'purchase', 'of', 'a', 'pipe', 'organ', 'with', 'no', 'noise', 'for', 'Europe', "'s", 'organization', '.', 'Numerous', 'donations', 'have', 'followed', 'the', 'analysis', 'after', 'a', 'noisy', 'debate', '.']

Original number of distinct tokens: 31


Now, we stem the tokens in the text:

In [17]:
# stem list of words and join
stemmed_output = ' '.join([porter.stem(w) for w in word_list])
print("Stemmed text:", stemmed_output)

# tokenize: split the text into words
stemmed_word_list = nltk.word_tokenize(stemmed_output)

print("\nStemmed word list:", stemmed_word_list)
print("\nStemmed number of distinct tokens:", len(set(stemmed_word_list)))

Stemmed text: the european commiss ha fund a numer studi to analyz the purchas of a pipe organ with no nois for europ 's organ . numer donat have follow the analysi after a noisi debat .

Stemmed word list: ['the', 'european', 'commiss', 'ha', 'fund', 'a', 'numer', 'studi', 'to', 'analyz', 'the', 'purchas', 'of', 'a', 'pipe', 'organ', 'with', 'no', 'nois', 'for', 'europ', "'s", 'organ', '.', 'numer', 'donat', 'have', 'follow', 'the', 'analysi', 'after', 'a', 'noisi', 'debat', '.']

Stemmed number of distinct tokens: 28


You can see the reduced vocabulary size. Some tokens are over-generalized (semantically different tokens that get the same stem), while others are under-generalized (semantically similar tokens that get different stems).

Try out [other stemmers](https://www.nltk.org/api/nltk.stem.html) available in NLTK.

In [18]:
# try out other stemmers
from nltk.stem.lancaster import LancasterStemmer

st = LancasterStemmer()

# stem list of words and join
stemmed_output = ' '.join([st.stem(w) for w in word_list])
print("Stemmed text:", stemmed_output)

# tokenize: split the text into words
stemmed_word_list = nltk.word_tokenize(stemmed_output)

print("\nStemmed word list:", stemmed_word_list)
print("\nStemmed number of distinct tokens:", len(set(stemmed_word_list)))

Stemmed text: the europ commit has fund a num study to analys the purchas of a pip org with no nois for europ 's org . num don hav follow the analys aft a noisy deb .

Stemmed word list: ['the', 'europ', 'commit', 'has', 'fund', 'a', 'num', 'study', 'to', 'analys', 'the', 'purchas', 'of', 'a', 'pip', 'org', 'with', 'no', 'nois', 'for', 'europ', "'s", 'org', '.', 'num', 'don', 'hav', 'follow', 'the', 'analys', 'aft', 'a', 'noisy', 'deb', '.']

Stemmed number of distinct tokens: 26


We can try a few for Portuguese:

In [20]:
nltk.download('rslp')

[nltk_data] Downloading package rslp to
[nltk_data]     C:\Users\ineso\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping stemmers\rslp.zip.


True

In [23]:
# Portuguese stemmer: https://www.nltk.org/_modules/nltk/stem/rslp.html
from nltk.stem import RSLPStemmer

stemmer = RSLPStemmer()
sentence = "Estou mesmo a gostar desta unidade curricular, todos gostamos de unidades curriculares interessantes."

word_list = nltk.word_tokenize(sentence)
stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print(stemmed_output)

est mesm a gost dest unidad curricul , tod gost de unidad curricul interess .


In [24]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("portuguese")
sentence = "Estou mesmo a gostar desta unidade curricular, todos gostamos de unidades curriculares interessantes."

word_list = nltk.word_tokenize(sentence)
stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print(stemmed_output)

estou mesm a gost dest unidad curricul , tod gost de unidad curricul interess .


#### Lemmatization

NLTK includes a [lemmatizer based on WordNet](https://www.nltk.org/api/nltk.stem.wordnet.html).

In [26]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ineso\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [27]:
# WordNet lemmatizer
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

sentence = "Men and women love to study artificial intelligence while studying data science. My feet and teeth are clean!"

# tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)

# lemmatize list of words
lemmatized_output = [lemmatizer.lemmatize(w) for w in word_list]
print(lemmatized_output)

['Men', 'and', 'women', 'love', 'to', 'study', 'artificial', 'intelligence', 'while', 'studying', 'data', 'science', '.', 'My', 'feet', 'and', 'teeth', 'are', 'clean', '!']
['Men', 'and', 'woman', 'love', 'to', 'study', 'artificial', 'intelligence', 'while', 'studying', 'data', 'science', '.', 'My', 'foot', 'and', 'teeth', 'are', 'clean', '!']


Compare the result with stemming applied to the same text.

In [29]:
# compare with stemming
stemmed_output = [stemmer.stem(w) for w in word_list]
print(stemmed_output)

['men', 'and', 'women', 'lov', 'to', 'study', 'artificial', 'intelligenc', 'whil', 'studying', 'dat', 'scienc', '.', 'my', 'feet', 'and', 'teeth', 'are', 'clean', '!']
