# Text Preprocessing I (week 2)

This lab is prepared with the materials in the articles [Text Preprocessing in Python: Steps, Tools, and Examples]( https://www.kdnuggets.com/2018/11/text-preprocessing-python.html)
and [Text Data Preprocessing: A Walkthrough in Python](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html)

We outline the basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing.  

In [1]:
print('Hello World!')

Hello World!


## 1. Text data preprocessing: step by step approach

### Convert text to lowercase using lower()

In [2]:
input_str = "The 5 biggest countries by population in 2017 are China, \
India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


### Tokenization
Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data]     (_ssl.c:727)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data]     (_ssl.c:727)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data]     (_ssl.c:727)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed (_ssl.c:727)>


False

In [4]:
from nltk.tokenize import word_tokenize # NLTK default tokenizer
from nltk.tokenize import TreebankWordTokenizer 

# word_tokenize is an implementation of TreebankWordTokenizer 
s = "NLTK is a leading platform for building Python programs to work with human language data." 
print(TreebankWordTokenizer().tokenize(s))  

input_str = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(input_str)
print()
print (tokens) # output same as TreebankWordTokenizer

s = "They'll save and invest more." 
#split standard contractions - they'll => 'they', "'ll"
print()
print(word_tokenize(s))

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']
()
['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']
()
['They', "'ll", 'save', 'and', 'invest', 'more', '.']


In [5]:
from nltk.tokenize import WordPunctTokenizer, WhitespaceTokenizer 

#Tokenize based on sequence of alphabetic (words) and non-alphabetic characters. 
print(WordPunctTokenizer().tokenize(s))

#Tokenize based on whitespace
print(WhitespaceTokenizer().tokenize(s)) 

['They', "'", 'll', 'save', 'and', 'invest', 'more', '.']
["They'll", 'save', 'and', 'invest', 'more.']


### Remove stop words
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.

In [6]:
from nltk.corpus import stopwords

input_str = "NLTK is a leading platform for building \
Python programs to work with human language data."
stop_words = set(stopwords.words('english'))
print("Stop Words:")
print(stop_words)
print()
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print("Results:")
print (result)

Stop Words:
set([u'all', u'just', u"don't", u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'o', u'don', u'hadn', u'herself', u'll', u'had', u'should', u'to', u'only', u'won', u'under', u'ours', u'has', u"should've", u"haven't", u'do', u'them', u'his', u'very', u"you've", u'they', u'not', u'during', u'now', u'him', u'nor', u"wasn't", u'd', u'did', u'didn', u'this', u'she', u'each', u'further', u"won't", u'where', u"mustn't", u"isn't", u'few', u'because', u"you'd", u'doing', u'some', u'hasn', u"hasn't", u'are', u'our', u'ourselves', u'out', u'what', u'for', u"needn't", u'below', u're', u'does', u"shouldn't", u'above', u'between', u'mustn', u't', u'be', u'we', u'who', u"mightn't", u"doesn't", u'were', u'here', u'shouldn', u'hers', u"aren't", u'by', u'on', u'about', u'couldn', u'of', u"wouldn't", u'against', u's', u'isn', u'or', u'own', u'into', u'yourself', u'down', u"hadn't", u'mightn', u"couldn't", u'wasn', u'your', u"you're", u'from', u'her', u'their', u'are

### Stemming
Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words) and Lancaster stemming algorithm (a more aggressive stemming algorithm).

In [7]:
# use PortStemmer
from nltk.stem import PorterStemmer

stemmer= PorterStemmer()
input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
.


In [8]:
# use LancasterStemmer - more aggressive
from nltk.stem import LancasterStemmer

stemmer= LancasterStemmer()
input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

ther
ar
sev
typ
of
stem
algorithm
.


### Lemmatization
The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses **lexical knowledge bases** to get the correct base forms of words.

In [9]:
from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse


### Sentence Segmentation

In [10]:
from nltk.tokenize import sent_tokenize

text = "this's a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it's your turn."
sent_tokenize_list = sent_tokenize(text)

print("The length of sentences", " = ", len(sent_tokenize_list))
print(sent_tokenize_list)

('The length of sentences', ' = ', 5)
["this's a sent tokenize test.", 'this is sent two.', 'is this sent three?', 'sent 4 is cool!', "Now it's your turn."]


In [11]:
def sentence_preprocess(document):
    sentences = nltk.sent_tokenize(document) 
    print(sentences, "\n")
    sentences = [nltk.word_tokenize(sent) for sent in sentences] 
    print(sentences, "\n")
    sentences = [nltk.pos_tag(sent) for sent in sentences] 
    print(sentences, "\n")

In [12]:
sample_doc = """I cannot say I like the movie. The acting is really lousy, don't you think so?

My favorite movie franchises, in order are: (1) Indiana Jones; (2) Marvel Cinematic Universe; (3) Star Wars; (4) Back to the Future.

This is a great movie by Tom Cruise."""

In [13]:
sentence_preprocess(sample_doc)

(['I cannot say I like the movie.', "The acting is really lousy, don't you think so?", 'My favorite movie franchises, in order are: (1) Indiana Jones; (2) Marvel Cinematic Universe; (3) Star Wars; (4) Back to the Future.', 'This is a great movie by Tom Cruise.'], '\n')
([['I', 'can', 'not', 'say', 'I', 'like', 'the', 'movie', '.'], ['The', 'acting', 'is', 'really', 'lousy', ',', 'do', "n't", 'you', 'think', 'so', '?'], ['My', 'favorite', 'movie', 'franchises', ',', 'in', 'order', 'are', ':', '(', '1', ')', 'Indiana', 'Jones', ';', '(', '2', ')', 'Marvel', 'Cinematic', 'Universe', ';', '(', '3', ')', 'Star', 'Wars', ';', '(', '4', ')', 'Back', 'to', 'the', 'Future', '.'], ['This', 'is', 'a', 'great', 'movie', 'by', 'Tom', 'Cruise', '.']], '\n')
([[('I', 'PRP'), ('can', 'MD'), ('not', 'RB'), ('say', 'VB'), ('I', 'PRP'), ('like', 'IN'), ('the', 'DT'), ('movie', 'NN'), ('.', '.')], [('The', 'DT'), ('acting', 'NN'), ('is', 'VBZ'), ('really', 'RB'), ('lousy', 'JJ'), (',', ','), ('do', 'VBP')

## 2. Additional Text Processing

### n-grams
The TextBlob.ngrams() method returns a list of tuples of n successive words.
[TextBlob](https://textblob.readthedocs.io/en/dev/) is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [14]:
# use TextBlob
!pip install TextBlob
from textblob import TextBlob
blob = TextBlob("Now is better than never.")
blob.ngrams(n=2)

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m


[WordList(['Now', 'is']),
 WordList(['is', 'better']),
 WordList(['better', 'than']),
 WordList(['than', 'never'])]

In [15]:
blob.ngrams(n=3)

[WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'never'])]

In [16]:
# use nltk
from nltk.util import ngrams, bigrams
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
tokens = word_tokenize(input_str)
list(ngrams(tokens, 2))

[('NLTK', 'is'),
 ('is', 'a'),
 ('a', 'leading'),
 ('leading', 'platform'),
 ('platform', 'for'),
 ('for', 'building'),
 ('building', 'Python'),
 ('Python', 'programs'),
 ('programs', 'to'),
 ('to', 'work'),
 ('work', 'with'),
 ('with', 'human'),
 ('human', 'language'),
 ('language', 'data'),
 ('data', '.')]

In [17]:
list(bigrams(tokens))

[('NLTK', 'is'),
 ('is', 'a'),
 ('a', 'leading'),
 ('leading', 'platform'),
 ('platform', 'for'),
 ('for', 'building'),
 ('building', 'Python'),
 ('Python', 'programs'),
 ('programs', 'to'),
 ('to', 'work'),
 ('work', 'with'),
 ('with', 'human'),
 ('human', 'language'),
 ('language', 'data'),
 ('data', '.')]

In [18]:
list(ngrams(tokens, 3))

[('NLTK', 'is', 'a'),
 ('is', 'a', 'leading'),
 ('a', 'leading', 'platform'),
 ('leading', 'platform', 'for'),
 ('platform', 'for', 'building'),
 ('for', 'building', 'Python'),
 ('building', 'Python', 'programs'),
 ('Python', 'programs', 'to'),
 ('programs', 'to', 'work'),
 ('to', 'work', 'with'),
 ('work', 'with', 'human'),
 ('with', 'human', 'language'),
 ('human', 'language', 'data'),
 ('language', 'data', '.')]

### Find most common ngrams

In [19]:
from collections import Counter
from nltk import ngrams
bigtxt = open('big.txt').read()
ngram_counts = Counter(ngrams(bigtxt.split(), 3))
ngram_counts.most_common(10)

[(('one', 'of', 'the'), 332),
 (('out', 'of', 'the'), 244),
 (('of', 'the', 'United'), 235),
 (('that', 'he', 'was'), 191),
 (('the', 'United', 'States'), 184),
 (('that', 'it', 'was'), 180),
 (('and', 'in', 'the'), 174),
 (('met', 'with', 'in'), 173),
 (('up', 'to', 'the'), 159),
 (('part', 'of', 'the'), 158)]

In [20]:
bigtxt = open('mbox.txt').read()
ngram_counts = Counter(ngrams(bigtxt.split(), 2))
ngram_counts.most_common(10)

[(('Received:', 'from'), 12579),
 (('ESMTP', 'id'), 7188),
 (('with', 'ESMTP'), 7188),
 (('Dec', '2007'), 7063),
 (('Nov', '2007'), 6810),
 (('-0500', 'Received:'), 5843),
 (('for', '<source@collab.sakaiproject.org>;'), 5391),
 (('text/plain;', 'charset=UTF-8'), 5391),
 (('+0000', '(GMT)'), 4932),
 (('[127.0.0.1])', 'by'), 3594)]

### WordNet
WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. We'll begin by looking at synonyms and how they are accessed in WordNet.
Reference: https://www.nltk.org/book/ch02.html and http://www.nltk.org/howto/wordnet.html

In [21]:
from nltk.corpus import wordnet as wn
wn.synsets('motorcar') 

[Synset('car.n.01')]

In [22]:
# The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas")
wn.synset('car.n.01').lemma_names()

[u'car', u'auto', u'automobile', u'machine', u'motorcar']

In [23]:
wn.synset('car.n.01').definition()

u'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [24]:
wn.synset('car.n.01').examples()

[u'he needs a car to get to work']

In [25]:
motorcar = wn.synset('car.n.01')
# hypernyms - parent synsets with broader meaning
types_of_motorcar = motorcar.hypernyms()  
print(types_of_motorcar)

[Synset('motor_vehicle.n.01')]


In [26]:
# from an item to its components (meronyms) 
wn.synset('tree.n.01').part_meronyms() 

[Synset('burl.n.02'),
 Synset('crown.n.07'),
 Synset('limb.n.02'),
 Synset('stump.n.01'),
 Synset('trunk.n.01')]

In [27]:
wn.lemma('supply.n.02.supply').antonyms()

[Lemma('demand.n.02.demand')]

## 2. Exercise - Text data preprocessing with a sample text

### Sample Data

In [28]:
sample_doc = """I cannot say I like the movie. The acting is really lousy, don't you think so?

My favorite movie franchises, in order are: (1) Indiana Jones; (2) Marvel Cinematic Universe; (3) Star Wars; (4) Back to the Future.

This is a great movie by Tom Cruise."""


### Sentence Segmentation

In [29]:
import nltk
from nltk.tokenize import sent_tokenize

sent_tokenize_list = sent_tokenize(sample_doc)

print("The length of sentences", " = ", len(sent_tokenize_list))
print(sent_tokenize_list)

('The length of sentences', ' = ', 4)
['I cannot say I like the movie.', "The acting is really lousy, don't you think so?", 'My favorite movie franchises, in order are: (1) Indiana Jones; (2) Marvel Cinematic Universe; (3) Star Wars; (4) Back to the Future.', 'This is a great movie by Tom Cruise.']


### Tokenization

In [30]:
from nltk.tokenize import word_tokenize # NLTK default tokenizer
words = nltk.word_tokenize(sample_doc) ### Should we tokenize document or sentence level?
print(words)

['I', 'can', 'not', 'say', 'I', 'like', 'the', 'movie', '.', 'The', 'acting', 'is', 'really', 'lousy', ',', 'do', "n't", 'you', 'think', 'so', '?', 'My', 'favorite', 'movie', 'franchises', ',', 'in', 'order', 'are', ':', '(', '1', ')', 'Indiana', 'Jones', ';', '(', '2', ')', 'Marvel', 'Cinematic', 'Universe', ';', '(', '3', ')', 'Star', 'Wars', ';', '(', '4', ')', 'Back', 'to', 'the', 'Future', '.', 'This', 'is', 'a', 'great', 'movie', 'by', 'Tom', 'Cruise', '.']


### Normalization

In [31]:
!pip install inflect 

import re
import inflect # usage - convert integers to text
from nltk.corpus import stopwords

print(words)
print()

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word) 
        # \w: an alphanumeric character; \s: a whitespace character
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all interger occurrences in list of tokenized words \
    with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def normalize(words):
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)
    words = remove_stopwords(words)
    return words

words = normalize(words)

print(words)

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
['I', 'can', 'not', 'say', 'I', 'like', 'the', 'movie', '.', 'The', 'acting', 'is', 'really', 'lousy', ',', 'do', "n't", 'you', 'think', 'so', '?', 'My', 'favorite', 'movie', 'franchises', ',', 'in', 'order', 'are', ':', '(', '1', ')', 'Indiana', 'Jones', ';', '(', '2', ')', 'Marvel', 'Cinematic', 'Universe', ';', '(', '3', ')', 'Star', 'Wars', ';', '(', '4', ')', 'Back', 'to', 'the', 'Future', '.', 'This', 'is', 'a', 'great', 'movie', 'by', 'Tom', 'Cruise', '.']
()
['say', 'like', 'movie', 'acting', 'really', 'lousy', 'nt', 'think', 'favorite', 'movie', 'franchises', 'order', u'one', 'indiana', 'jones', u'two', 'marvel', 'cinematic',

### Stemming and lemming functions

In [32]:
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

stems, lemmas = stem_and_lemmatize(words)

print('Stemmed:\n', stems)
print('\nLemmatized:\n', lemmas)

('Stemmed:\n', ['say', 'lik', u'movy', 'act', 'real', 'lousy', 'nt', 'think', 'favorit', u'movy', 'franch', 'ord', u'on', 'indian', 'jon', u'two', 'marvel', 'cinem', 'univers', u'three', 'star', 'war', u'four', 'back', 'fut', 'gre', u'movy', 'tom', 'cru'])
('\nLemmatized:\n', ['say', 'like', 'movie', u'act', 'really', 'lousy', 'nt', 'think', 'favorite', 'movie', u'franchise', 'order', u'one', 'indiana', 'jones', u'two', 'marvel', 'cinematic', 'universe', u'three', 'star', u'war', u'four', 'back', 'future', 'great', 'movie', 'tom', 'cruise'])
