# Stemming

- Stemming is a process of stripping affixes from words.
example: 
input - Natural language processing in python
         output - Natur Languag process in pyton

- More often, you normalize text by converting all the words into lowercase. This will treat both words The and the as same.

- With stemming, the words playing, played and play will be treated as single word, i.e. play.

# Stemmers in nltk

- nltk comes with few stemmers.

- The two widely used stemmers are Porter and Lancaster stemmers.

- These stemmers have their own rules for string affixes.

- The following example demonstrates stemming of word builders using PorterStemmer.

In [7]:
import nltk
from nltk import PorterStemmer,LancasterStemmer

In [8]:
porter = nltk.PorterStemmer()

In [9]:
porter.stem('builders')

'builder'

In [10]:
porter.stem('played')

'play'

- Now let's see how to use LancasterStemmer and stem the word builders.

In [11]:
lancaster = LancasterStemmer()

In [12]:
lancaster.stem('builders')

'build'

- Lancaster Stemmer returns build whereas Porter Stemmer returns builder.

# Normalizing with Stemming
- Let's consider the text collection, text1.

- Let's first determine the number of unique words present in original text1.

- Then normalize the text by converting all the words into lower case and again determine the number of unique words.

In [13]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [14]:
len(set(text1))

19317

In [15]:
lc_words = [ word.lower() for word in text1] 

In [16]:
len(set(lc_words))

17231

- Now let's further normalize text1 with Porter Stemmer.

In [17]:
p_stem_words = [porter.stem(word) for word in set(lc_words) ]

In [18]:
len(set(p_stem_words))

10927

- The above output shows that, after normalising with Porter Stemmer, the text1 collection has 10927 unique words.

In [19]:
l_stem_words = [lancaster.stem(word) for word in set(lc_words) ]

In [20]:
len(set(l_stem_words))

9036

- Applying Lancaster Stemmer to text1 collection resulted in 9036 words.

# Understanding Lemma
- Lemma is a lexical entry in a lexical resource such as word dictionary.

- You can find multiple Lemma's with the same spelling. These are known as homonyms.

- For example, consider the two Lemma's listed below, which are homonyms.

1. saw [verb] - Past tense of see
2. saw [noun] - Cutting instrument


# Lemmatization
- nltk comes with WordNetLemmatizer. This lemmatizer removes affixes only if the resulting word is found in lexical resource, Wordnet.
- WordNetLemmatizer is majorly used to build a vocabulary of words, which are valid Lemmas.

In [21]:
wnl = nltk.WordNetLemmatizer()

In [22]:
wnl_stem_words = [wnl.lemmatize(word) for word in set(lc_words) ]

In [23]:
len(set(wnl_stem_words))

15168

In [26]:
from nltk.corpus import brown

In [27]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [30]:
humor_words = brown.words(categories='humor')

In [31]:
lc_humor_words = [word.lower() for word in humor_words]

In [32]:
lc_humor_uniq_words = list(set(lc_humor_words))

In [33]:
lc_humor_uniq_words

['profound',
 'courtiers',
 'cup',
 'victory',
 'frozen',
 'feeley',
 'oscar',
 'secured',
 'change',
 'dictum',
 'blind',
 'apartment',
 'perle',
 'escaped',
 'tells',
 "wife's",
 'winked',
 'charming',
 "clergyman's",
 'contributes',
 'silent',
 'excess',
 'sung',
 'forecast',
 'lists',
 'secretly',
 'curio',
 'commoners',
 'uniformed',
 'jeunes',
 'culmination',
 'their',
 'violently',
 'read',
 'free',
 'salary',
 'paid',
 'hope',
 'probably',
 'my',
 'downfall',
 'touch',
 'apparent',
 'pattern',
 'admit',
 'mighty',
 'saner',
 'los',
 'himself',
 'scandal',
 'babylon',
 'comfortably',
 'metal',
 'maltreat',
 'focused',
 'wynn',
 'tries',
 'modern',
 'trump',
 'parent',
 'low',
 'presently',
 'aviary',
 'during',
 "carpenter's",
 'remarks',
 'threw',
 'roles',
 'inches',
 'understand',
 "it's",
 'comprehend',
 'lifelong',
 'ready',
 'glad',
 'relentlessness',
 'acts',
 'carvings',
 'filling',
 'gooshey',
 'jumpy',
 'stems',
 'bothering',
 'cowardice',
 'pages',
 'beaches',
 'floor

In [34]:
from nltk.corpus import words

In [41]:
wordlist_words = words.words()

In [42]:
wordlist_uniq_words = list(set(wordlist_words))

In [43]:
print(lc_humor_uniq_words)

['profound', 'courtiers', 'cup', 'victory', 'frozen', 'feeley', 'oscar', 'secured', 'change', 'dictum', 'blind', 'apartment', 'perle', 'escaped', 'tells', "wife's", 'winked', 'charming', "clergyman's", 'contributes', 'silent', 'excess', 'sung', 'forecast', 'lists', 'secretly', 'curio', 'commoners', 'uniformed', 'jeunes', 'culmination', 'their', 'violently', 'read', 'free', 'salary', 'paid', 'hope', 'probably', 'my', 'downfall', 'touch', 'apparent', 'pattern', 'admit', 'mighty', 'saner', 'los', 'himself', 'scandal', 'babylon', 'comfortably', 'metal', 'maltreat', 'focused', 'wynn', 'tries', 'modern', 'trump', 'parent', 'low', 'presently', 'aviary', 'during', "carpenter's", 'remarks', 'threw', 'roles', 'inches', 'understand', "it's", 'comprehend', 'lifelong', 'ready', 'glad', 'relentlessness', 'acts', 'carvings', 'filling', 'gooshey', 'jumpy', 'stems', 'bothering', 'cowardice', 'pages', 'beaches', 'floor', 'serving', 'needless', 'seems', 'spot', 'presidential', "'", 'portant', 'shouted', 

In [45]:
print(wordlist_uniq_words)

235892
