<p style="text-align:center">
PSY 394U <b>Data Analytics with Python</b>, Spring 2019


<img style="width: 700px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2019/blob/master/Images/Banner.png?raw=true" alt="title pics"/>

</p>

<p style="text-align:center; font-size:40px; margin-bottom: 30px;"><b> Text processing </b></p>

<p style="text-align:center; font-size:18px; margin-bottom: 32px;"><b>April 16, 2019</b></p>

<hr style="height:5px;border:none" />

# 1. Example text data
<hr style="height:1px;border:none" />

To start out with **NLTK** (**Natural Language Toolkit**), we first download some example data available for NLTK. 

`<NLTKIntro.py>`

In [1]:
import nltk

# downloading example text corpora "book"
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

This should open a downloader window

<img style="width: 600px; padding: 0px;" src="https://github.com/sathayas/JupyterAnalyticsSpring2019/blob/master/Images/nltk_downloader.png?raw=true" alt="NLTK downloader"/>



From the downloader, please download the collection **book**. It contains a number of corpora (plural of corpus) used in the [NLTK book](http://www.nltk.org/book/). A corpus is a collection of texts. It may take a few minutes to download this collection. It takes about 420MB of disk space on your computer. 

Now, let's take a look at the *Gutenberg corpus* (**`gutenberg`** under **`nltk.corpus`**). The *Gutenberg* corpus includes a number of literary works, and can be used as example data sets.

In [2]:
# loading the Gutenberg corpus
from nltk.corpus import gutenberg

We can examine the content of this corpus by the **`fileids()`** method, which produces a list of files as part of this corpus.

In [3]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

From this, we will use Emma by Jane Austen (**`austen-emma.txt`**). We can simply load the raw text data with the **`raw`** method associated with the corpus `gutenberg`.

In [4]:
# loading the raw text
emmaRawText = gutenberg.raw('austen-emma.txt')
print(emmaRawText[:300])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was t


Or, we can load as a collection of words using the **`words`** method.

In [5]:
# loading words
emmaWords = gutenberg.words('austen-emma.txt')
print(emmaWords[:80])

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.', 'She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in']


# 2. Tokenizing sentences and words
<hr style="height:1px;border:none" />

NLTK comes with **tokenizers** that let up split a text data into sentences or words. Here is an example. For this example, we will be downloading a text data set (The Adventures of Sherlock Holmes by Arthur Conan Doyle) from the [Project Gutenberg's main web site](http://www.gutenberg.org), a repository of free electronic books.

`<Tokenize.py>`

In [6]:
import nltk

# Loading The Adventures of Sherlock Holmes by Arthur Conan Doyle
# from the Project Gutenberg
from urllib import request
url = "http://www.gutenberg.org/ebooks/1661.txt.utf-8"
response = request.urlopen(url)
rawText = response.read().decode('utf8')

Here, **`rawText`** contains the entire book. Here is an excerpt.

In [7]:
print(rawText[1210:1800])



ADVENTURE I. A SCANDAL IN BOHEMIA

I.

To Sherlock Holmes she is always THE woman. I have seldom heard
him mention her under any other name. In his eyes she eclipses
and predominates the whole of her sex. It was not that he felt
any emotion akin to love for Irene Adler. All emotions, and that
one particularly, were abhorrent to his cold, precise but
admirably balanced mind. He was, I take it, the most perfect
reasoning and observing machine that the world has seen, but as a
lover he would have placed himself in a false position. He never
spoke of the softer passions, 


The sentence tokenizer, **`sent_tokenize`** breaks up a text into sentences.

In [8]:
# breaking up the raw text into sentences
sentText = nltk.sent_tokenize(rawText)

print(sentText[14:17])

['To Sherlock Holmes she is always THE woman.', 'I have seldom heard\r\nhim mention her under any other name.', 'In his eyes she eclipses\r\nand predominates the whole of her sex.']


The word tokenizer, **`word_tokenize`** breaks up a text into words.

In [9]:
# breaking up sentences into words
print(nltk.word_tokenize(sentText[14]))
print(nltk.word_tokenize(sentText[15]))
print(nltk.word_tokenize(sentText[16]))

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'THE', 'woman', '.']
['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.']
['In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', '.']


In [10]:
# breaking up the raw text into words
wordText = nltk.word_tokenize(rawText)

In [11]:
wordText[220:235]

['To',
 'Sherlock',
 'Holmes',
 'she',
 'is',
 'always',
 'THE',
 'woman',
 '.',
 'I',
 'have',
 'seldom',
 'heard',
 'him',
 'mention']

## Word frequency

Now that the entire text is tokenized into words, we can determine word frequencies using the **`FreqDist`** function.

In [12]:
# word frequency
wordDist = nltk.FreqDist(wordText)

Here, the resulting **`wordDist`** is a dictionary-like object. 

In [13]:
wordDist

FreqDist({',': 7779, '.': 5867, 'the': 5420, 'I': 3034, 'and': 2871, 'of': 2733, 'to': 2729, '``': 2723, 'a': 2595, "''": 2392, ...})

You can get a list of most frequent words with the **`most_common`** method.

In [14]:
print(wordDist.most_common(30))

[(',', 7779), ('.', 5867), ('the', 5420), ('I', 3034), ('and', 2871), ('of', 2733), ('to', 2729), ('``', 2723), ('a', 2595), ("''", 2392), ('in', 1744), ('that', 1662), ('was', 1395), ('it', 1302), ('you', 1271), ('he', 1167), ('is', 1134), ('his', 1102), ('have', 907), ('my', 906), ('with', 849), ('had', 824), ('as', 780), ('which', 770), ('at', 743), ('?', 737), ('for', 716), ('not', 686), ('be', 642), ('me', 635)]


Or you can get the frequency of a particular word. 

In [15]:
wordDist['adventure']

7

In [16]:
wordDist['deduction']

6

## Manipulating string data

After tokenization, the text data can be handled as a collection of words. We can take advantage of various string manipulation methods and functions available in Python. For example, we can convert all words to lower case with the **`lower()`** method.

In [17]:
# word frequency after converting to lower case
wordTextLower = [w.lower() for w in wordText]
wordDistLower = nltk.FreqDist(wordTextLower)
print(wordDistLower.most_common(30))

[(',', 7779), ('.', 5867), ('the', 5793), ('and', 3061), ('i', 3034), ('of', 2777), ('to', 2761), ('``', 2723), ('a', 2697), ("''", 2392), ('in', 1818), ('that', 1757), ('it', 1736), ('you', 1536), ('he', 1484), ('was', 1413), ('his', 1158), ('is', 1148), ('my', 999), ('have', 927), ('with', 877), ('as', 861), ('had', 833), ('at', 782), ('which', 776), ('for', 749), ('?', 737), ('not', 709), ('but', 648), ('be', 646)]


We can generate a list of unique words in a word list by the **`set`** function. Here, we can extract some words satisfying certain conditions.

In [18]:
# just long words (10 characters or more)
wordSetLower = sorted(set(wordTextLower))  # unique word list
longWords = [w for w in wordSetLower if len(w)>9]
print(longWords[:10])

["'breckinridge", "'certainly", "'encyclopaedia", "'gesellschaft", "'hampshire", "'photography", "'pondicherry", "'precisely", "'remarkable", "'undoubtedly"]


In [19]:
# long words appearing more than 20 times
longFreqWords = [w for w in wordSetLower
                 if (len(w)>9) and wordDistLower[w]>20]
print(longFreqWords)

['absolutely', 'considerable', 'electronic', 'foundation', 'gutenberg-tm', 'interesting', 'photograph', 'stepfather', 'understand']


### Exercise
1. **Long, frequent words, Emma**. Generate a list of long words (10 characters or more) appearing frequently (more than 20 times) in Emma.

# 3. Part-of-speech tags
<hr style="height:1px;border:none" />

**Part-of-speech** (**POS**) tags indicate categories of words with similar grammatical properties (e.g., verbs, nouns, adjectives, etc.). In NLTK, the **`pos_tag`** function can assign POS tags for each word. Here is an example. We focus on a sentence from Emma.

`<POS-Tag.py>`

In [20]:
import nltk

# loading the Emma by Jane Austen
from nltk.corpus import gutenberg
emmaRawText = gutenberg.raw('austen-emma.txt')

# tokenizing
emmaSents = nltk.sent_tokenize(emmaRawText)
print(emmaSents[5])

Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.


We break this up into words via `word_tokenize`.

In [21]:
emmaWords = nltk.word_tokenize(emmaSents[5])

In [22]:
emmaWords

['Even',
 'before',
 'Miss',
 'Taylor',
 'had',
 'ceased',
 'to',
 'hold',
 'the',
 'nominal',
 'office',
 'of',
 'governess',
 ',',
 'the',
 'mildness',
 'of',
 'her',
 'temper',
 'had',
 'hardly',
 'allowed',
 'her',
 'to',
 'impose',
 'any',
 'restraint',
 ';',
 'and',
 'the',
 'shadow',
 'of',
 'authority',
 'being',
 'now',
 'long',
 'passed',
 'away',
 ',',
 'they',
 'had',
 'been',
 'living',
 'together',
 'as',
 'friend',
 'and',
 'friend',
 'very',
 'mutually',
 'attached',
 ',',
 'and',
 'Emma',
 'doing',
 'just',
 'what',
 'she',
 'liked',
 ';',
 'highly',
 'esteeming',
 'Miss',
 'Taylor',
 "'s",
 'judgment',
 ',',
 'but',
 'directed',
 'chiefly',
 'by',
 'her',
 'own',
 '.']

Now, we assign POS tags to this sentence.

In [23]:
# POS tagging of an example sentence
emmaTagged = nltk.pos_tag(emmaWords)
print(emmaTagged)

[('Even', 'RB'), ('before', 'IN'), ('Miss', 'NNP'), ('Taylor', 'NNP'), ('had', 'VBD'), ('ceased', 'VBN'), ('to', 'TO'), ('hold', 'VB'), ('the', 'DT'), ('nominal', 'JJ'), ('office', 'NN'), ('of', 'IN'), ('governess', 'NN'), (',', ','), ('the', 'DT'), ('mildness', 'NN'), ('of', 'IN'), ('her', 'PRP$'), ('temper', 'NN'), ('had', 'VBD'), ('hardly', 'RB'), ('allowed', 'VBN'), ('her', 'PRP'), ('to', 'TO'), ('impose', 'VB'), ('any', 'DT'), ('restraint', 'NN'), (';', ':'), ('and', 'CC'), ('the', 'DT'), ('shadow', 'NN'), ('of', 'IN'), ('authority', 'NN'), ('being', 'VBG'), ('now', 'RB'), ('long', 'RB'), ('passed', 'VBN'), ('away', 'RB'), (',', ','), ('they', 'PRP'), ('had', 'VBD'), ('been', 'VBN'), ('living', 'VBG'), ('together', 'RB'), ('as', 'IN'), ('friend', 'NN'), ('and', 'CC'), ('friend', 'VB'), ('very', 'RB'), ('mutually', 'RB'), ('attached', 'VBN'), (',', ','), ('and', 'CC'), ('Emma', 'NNP'), ('doing', 'VBG'), ('just', 'RB'), ('what', 'WP'), ('she', 'PRP'), ('liked', 'VBD'), (';', ':'), (

Each word is represented as a tuple, a pair of the word and its POS tag. You may ask, "what are those symbols, such as RB or IN?" You can see a list of tags by

In [24]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

With these tags, we can easily extract a certain category of words. For example, a tag for a verb usually starts with characters **`"VB"`**. So we can extract verbs only from the text data by:

In [25]:
# extracting verbs only (starting with VB)
emmaVerbs = [w for w in emmaTagged if 'VB' in w[1]]
print(emmaVerbs)

[('had', 'VBD'), ('ceased', 'VBN'), ('hold', 'VB'), ('had', 'VBD'), ('allowed', 'VBN'), ('impose', 'VB'), ('being', 'VBG'), ('passed', 'VBN'), ('had', 'VBD'), ('been', 'VBN'), ('living', 'VBG'), ('friend', 'VB'), ('attached', 'VBN'), ('doing', 'VBG'), ('liked', 'VBD'), ('esteeming', 'VBG'), ('directed', 'VBD')]


Or adverbs (tags starting with **`"RB"`**):

In [26]:
# extracting adverbs only
emmaAdv = [w for w in emmaTagged if 'RB' in w[1]]
print(emmaAdv)

[('Even', 'RB'), ('hardly', 'RB'), ('now', 'RB'), ('long', 'RB'), ('away', 'RB'), ('together', 'RB'), ('very', 'RB'), ('mutually', 'RB'), ('just', 'RB'), ('highly', 'RB')]


Or proper nouns (tags starting with **`"NNP"`**):

In [27]:
# extracting proper nouns only
emmaNNP = [w for w in emmaTagged if 'NNP' in w[1]]
print(emmaNNP)

[('Miss', 'NNP'), ('Taylor', 'NNP'), ('Emma', 'NNP'), ('Miss', 'NNP'), ('Taylor', 'NNP')]


### Exercise
1. **POS frequencies**. In the POS-tagged sentence **`emmaTagged`**, count the number of
  * Verbs: tags starting with **VB**
  * Nouns: tags starting with **NN**
  * Adjectives: tags starting with **JJ**
  * Adverbs: tags starting with **RB**

# 4. Stop words and punctuation marks
<hr style="height:1px;border:none" />

In the word frequency from an earlier example, you may have noticed that the most frequent words are actually punctuation marks (e.g., commas (,), periods (.), question marks (?), etc.) and a class of words known as stop words (e.g., "the", "and", "of", "to"). When analyzing text data, punctuation marks and stop words do not provide much information. Thus, we can eliminate these words.

## Removing punctuation marks

We can remove punctuation marks simply by using the **`isalpha()`** method for string data. For example, in the example sentence from Emma we saw earlier,

`<StopwordsPunct.py>`

In [28]:
import nltk
from nltk.corpus import stopwords

# loading the Emma by Jane Austen
from nltk.corpus import gutenberg
rawText = gutenberg.raw('austen-emma.txt')

# tokenizing
sentText = nltk.sent_tokenize(rawText)
print(sentText[5])
wordText = nltk.word_tokenize(sentText[5])

Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.


We can examine whether each word consists of alphabets. If a word is anything other than alphabets, it will be eliminated.

In [29]:
# removing punctuation marks, making all words lower case
wordDePunct = [w.lower() for w in wordText if w.isalpha()]

In [30]:
print(wordDePunct)

['even', 'before', 'miss', 'taylor', 'had', 'ceased', 'to', 'hold', 'the', 'nominal', 'office', 'of', 'governess', 'the', 'mildness', 'of', 'her', 'temper', 'had', 'hardly', 'allowed', 'her', 'to', 'impose', 'any', 'restraint', 'and', 'the', 'shadow', 'of', 'authority', 'being', 'now', 'long', 'passed', 'away', 'they', 'had', 'been', 'living', 'together', 'as', 'friend', 'and', 'friend', 'very', 'mutually', 'attached', 'and', 'emma', 'doing', 'just', 'what', 'she', 'liked', 'highly', 'esteeming', 'miss', 'taylor', 'judgment', 'but', 'directed', 'chiefly', 'by', 'her', 'own']


## Removing stop words

Now, to remove stop words, we first have to import the corpus of stop words **`stopwords`**, and select that of English.

In [31]:
stop_words = set(stopwords.words('english'))  # stop words in English

In [32]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

Then we examine whether each word is part of the set of stop words (**`stop_words`**). If a word is a stop word, that word is eliminated.

In [33]:
wordNoStopwd = [w for w in wordDePunct if w not in stop_words]
print(wordNoStopwd)

['even', 'miss', 'taylor', 'ceased', 'hold', 'nominal', 'office', 'governess', 'mildness', 'temper', 'hardly', 'allowed', 'impose', 'restraint', 'shadow', 'authority', 'long', 'passed', 'away', 'living', 'together', 'friend', 'friend', 'mutually', 'attached', 'emma', 'liked', 'highly', 'esteeming', 'miss', 'taylor', 'judgment', 'directed', 'chiefly']


## Word frequency, before and after cleaning

We shall see how removing punctuation marks and stop words alters the word frequency using the Sherlock Holmes example.

`<StopwordsFreq.py>`

In [34]:
import nltk
from nltk.corpus import stopwords

# Loading The Adventures of Sherlock Holmes by Arthur Conan Doyle
# from the Project Gutenberg
from urllib import request
url = "http://www.gutenberg.org/ebooks/1661.txt.utf-8"
response = request.urlopen(url)
rawText = response.read().decode('utf8')

# tokenizing
wordText = nltk.word_tokenize(rawText)

Here is a list of top 30 most frequent words, before the removal of punctuation marks and stop words.

In [35]:
# word frequency before removing punctuations and stop words
print('Before text processing')
wordFreqBefore = nltk.FreqDist(wordText)
for iWord in wordFreqBefore.most_common(30):
    print('%-15s\t%6d' % iWord)

Before text processing
,              	  7779
.              	  5867
the            	  5420
I              	  3034
and            	  2871
of             	  2733
to             	  2729
``             	  2723
a              	  2595
''             	  2392
in             	  1744
that           	  1662
was            	  1395
it             	  1302
you            	  1271
he             	  1167
is             	  1134
his            	  1102
have           	   907
my             	   906
with           	   849
had            	   824
as             	   780
which          	   770
at             	   743
?              	   737
for            	   716
not            	   686
be             	   642
me             	   635


And here is a list of 30 most frequent words AFTER removing punctuation marks and stop words.

In [36]:
# removing punctuations and stopwords
wordDePunct = [w.lower() for w in wordText if w.isalpha()]
stop_words = set(stopwords.words('english'))  # stop words in English
wordNoStopwd = [w for w in wordDePunct if w not in stop_words]


# word frequency after removing punctuations and stop words
print('After text processing')
wordFreqAfter = nltk.FreqDist(wordNoStopwd)
for iWord in wordFreqAfter.most_common(30):
    print('%-15s\t%6d' % iWord)

After text processing
said           	   486
upon           	   467
holmes         	   466
one            	   374
would          	   333
man            	   303
could          	   288
little         	   269
see            	   232
may            	   210
us             	   184
well           	   176
think          	   174
must           	   171
know           	   171
shall          	   171
come           	   160
time           	   151
came           	   146
two            	   143
door           	   141
back           	   139
room           	   134
face           	   128
might          	   126
matter         	   125
much           	   121
way            	   116
yes            	   114
heard          	   113


# 5. Stemming and lemmatizing
<hr style="height:1px;border:none" />

Consider the word "spam." It can be a noun in the singular or plural form ("spam" or "spams"). It can be a verb, thus conjugates depending on the context (e.g., "spamming", "spammed"). Or a new word can be created by adding a suffix (e.g., "spammer", "spamize", "spamly"). In terms of semantics, these are the same word, or closely related words. However, because they are spelled differently, they may be considered as distinct words. Two ways to get around that problem are **stemming** and **lemmatizing**.

## Stemming

**Stemming** extracts the root from a word. Here is a simple example.

`<Stemming.py>`

In [37]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# sample words
sampleWords = ['spam', 'spams', 'spamming', 'spammed', 'spammer', 'spammers',
               'spamize','spamly']

To perform stemming, we need to use the **`PorterStemmer`** transformation object under **`nltk.stem`**. We first need to define a stemming transformation object.

In [38]:
# stemmer object
ps = PorterStemmer()

Once the object **`ps`** is defined, then we can use the **`stem`** method to determine the stem of a word. In our **`sampleWords`**, after stemming,

In [39]:
for w in sampleWords:
    print(ps.stem(w))

spam
spam
spam
spam
spammer
spammer
spamiz
spamli


You may notice that there are some common stems (**`spam`** and **`spammer`**). However, the other stems do not resemble real words (**`spamiz`** and **`spamli`**).

Just for fun, we can apply stemming to our example sentence from Emma.

In [40]:
# loading the Emma by Jane Austen
from nltk.corpus import gutenberg
rawText = gutenberg.raw('austen-emma.txt')

# tokenizing
sentText = nltk.sent_tokenize(rawText)
wordText = nltk.word_tokenize(sentText[5])

# removing punctuation marks & stop words, making all words lower case, 
wordDePunct = [w.lower() for w in wordText if w.isalpha()]
stop_words = set(stopwords.words('english'))  # stop words in English
wordNoStopwd = [w for w in wordDePunct if w not in stop_words]

Here are the words before stemming:

In [41]:
# before stemming
print(wordNoStopwd)

['even', 'miss', 'taylor', 'ceased', 'hold', 'nominal', 'office', 'governess', 'mildness', 'temper', 'hardly', 'allowed', 'impose', 'restraint', 'shadow', 'authority', 'long', 'passed', 'away', 'living', 'together', 'friend', 'friend', 'mutually', 'attached', 'emma', 'liked', 'highly', 'esteeming', 'miss', 'taylor', 'judgment', 'directed', 'chiefly']


And here are the words after stemming:

In [42]:
# after stemming
wordStem = [ps.stem(w) for w in wordNoStopwd]
print(wordStem)

['even', 'miss', 'taylor', 'ceas', 'hold', 'nomin', 'offic', 'gover', 'mild', 'temper', 'hardli', 'allow', 'impos', 'restraint', 'shadow', 'author', 'long', 'pass', 'away', 'live', 'togeth', 'friend', 'friend', 'mutual', 'attach', 'emma', 'like', 'highli', 'esteem', 'miss', 'taylor', 'judgment', 'direct', 'chiefli']


As you can see, it works in some cases (e.g., `allowed` $\rightarrow$ `allow` or `mutually` $\rightarrow$ `mutual`). But some results are non-words (e.g., `ceased` $\rightarrow$ `ceas` or `together` $\rightarrow$ `togeth`). 

## Lemmatizing

Unlike stemming, **lemmatizing** maps a word to its original form. For example,
  * `cats` $\rightarrow$ `cat`
  * `cacti` $\rightarrow$ `cactus`
  * `geese` $\rightarrow$ `goose`

In NLTK, there is a transformation object **`WordNetLemmatizer`** under **`nltk.stem`**, that implements lemmatization. Here are some examples.

`<Lemmatizing.py>`

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# sample words
sampleWords = ['cats','cacti','geese','rocks','oxen','ran','spamming',
               'spammed','spammer','moves','movement','better']
for w in sampleWords:
    print(w)

In [None]:
# lemmatizer object
lmt = WordNetLemmatizer()

# lemmatized words
for w in sampleWords:
    print(lmt.lemmatize(w))

By default, the lemmatizer assumes that the input word is a noun. If the word is not a noun, you can specify that with the **`pos`** parameter (**`'a'`** for adjectives, **`'v'`** for verbs, and **`'r'`** for adverbs). 

In [None]:
# some non-noun words
print(lmt.lemmatize('ran', pos='v'))

In [None]:
print(lmt.lemmatize('better', pos='a'))

Now let's go back to our sample sentence. 

In [None]:
# loading the Emma by Jane Austen
from nltk.corpus import gutenberg
rawText = gutenberg.raw('austen-emma.txt')

# tokenizing
sentText = nltk.sent_tokenize(rawText)
wordText = nltk.word_tokenize(sentText[5])

# removing punctuation marks & stop words, making all words lower case
wordDePunct = [w.lower() for w in wordText if w.isalpha()]
stop_words = set(stopwords.words('english'))  # stop words in English
wordNoStopwd = [w for w in wordDePunct if w not in stop_words]

Here are the words before lemmatizing.

In [None]:
# before lemmatizing
print(wordNoStopwd)

And after lemmatizing.

In [None]:
# after lemmatizing
wordLemma = [lmt.lemmatize(w) for w in wordNoStopwd]
print(wordLemma)

As you can see, verbs are still not lemmatized. So we use information from POS tags to convert words according to their types.

In [None]:
# using POS tags
wordPOS = nltk.pos_tag(wordText)
# removing punctuation marks & stop words, making all words lower case, 
wordPOSDePunct = [(w[0].lower(), w[1]) for w in wordPOS if w[0].isalpha()]
wordPOSNoStopwd = [w for w in wordPOSDePunct if w[0] not in stop_words]
# initializing the lammatized word list
wordPOSLemma = []
for wPair in wordPOSNoStopwd:
    if wPair[1][0] == 'J':   # i.e., adjectives
        wordPOSLemma.append(lmt.lemmatize(wPair[0],pos='a'))
    elif wPair[1][0] == 'V':  # i.e., verbs
        wordPOSLemma.append(lmt.lemmatize(wPair[0],pos='v'))
    elif 'RB' in wPair[1]:  # i.e., adverbs
        wordPOSLemma.append(lmt.lemmatize(wPair[0],pos='r'))
    else:
        wordPOSLemma.append(lmt.lemmatize(wPair[0]))
print(wordPOSLemma)

Much better!

## Lemmatizing and word frequency

We can apply lemmatization on a larger text data. Let's examine the Sherlock Holmes example, and see how lemmatization affects word frequency counts.

`<LemmatizingFreq.py>`

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


# Loading The Adventures of Sherlock Holmes by Arthur Conan Doyle
# from the Project Gutenberg
from urllib import request
url = "http://www.gutenberg.org/ebooks/1661.txt.utf-8"
response = request.urlopen(url)
rawText = response.read().decode('utf8')


# tokenizing
wordText = nltk.word_tokenize(rawText)

Here are the top 30 most frequent words, before lemmatization and other fancy text processing (removing punctuation marks and stop words, and turning all letters lower case).

In [None]:
# word frequency before doing fancy text processing stuff
print('Before text processing')
wordFreqBefore = nltk.FreqDist(wordText)
for iWord in wordFreqBefore.most_common(30):
    print('%-15s\t%6d' % iWord)

Now some fancy text processing and lemmatizing.

In [None]:
# removing punctuation marks & stop words, making all words lower case, 
wordDePunct = [w.lower() for w in wordText if w.isalpha()]
stop_words = set(stopwords.words('english'))  # stop words in English
wordNoStopwd = [w for w in wordDePunct if w not in stop_words]

# Lemmatizing using POS tags
lmt = WordNetLemmatizer()
wordPOS = nltk.pos_tag(wordText)
# removing punctuation marks & stop words, making all words lower case, 
wordPOSDePunct = [(w[0].lower(), w[1]) for w in wordPOS if w[0].isalpha()]
wordPOSNoStopwd = [w for w in wordPOSDePunct if w[0] not in stop_words]
# initializing the lammatized word list
wordPOSLemma = []
for wPair in wordPOSNoStopwd:
    if wPair[1][0] == 'J':   # i.e., adjectives
        wordPOSLemma.append(lmt.lemmatize(wPair[0],pos='a'))
    elif wPair[1][0] == 'V':  # i.e., verbs
        wordPOSLemma.append(lmt.lemmatize(wPair[0],pos='v'))
    elif 'RB' in wPair[1]:  # i.e., adverbs
        wordPOSLemma.append(lmt.lemmatize(wPair[0],pos='r'))
    else:
        wordPOSLemma.append(lmt.lemmatize(wPair[0]))


We shall see how these steps affect the word frequency count. Here are the top 30 words on the processed and lemmatized text.

In [None]:
# word frequency after fancy text processing stuff
print('After text processing')
wordFreqAfter = nltk.FreqDist(wordPOSLemma)
for iWord in wordFreqAfter.most_common(30):
    print('%-15s\t%6d' % iWord)

### Exercise
1. **Backward lemmatized word frequency**. Words that can be lemmatized to **`think`** include:
```
think, Think, thinks, Thinks, thought, Thought, thinking, Thinking
```
  Determine the frequency counts of these words in the Sherlock Holmes data set before text processing (i.e., in **`wordFreqBefore`**). What is the sum of the frequencies?

2. **Verb or noun?** Among the possible words listed in the previous exercise, word **`thought`** is lemmatized to **`think`** if it is a verb, while it is lemmatized to **`thought`** if it is a noun. Determine how many of **`thought`**'s are verbs and how many of them are nouns. 