## Stemming

Stemming is the process of reducing a word to its **word stem** that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

** Let us say we have a Classification Problem** 
Find out whether the comment on a product is positive or negative.
Reviews(type text) -> eating, eaten, eat -> root word is eat.


Stemming helps to reduce the dimensionality of the data.
Each word represents a vector in a high dimensional space.

#### Stemming Algorithms
1. *Porter Stemmer*
2. *Snowball Stemmer*
3. *Lancaster Stemmer*

#### Advantages of Stemming
* Reduces the dimensionality of the data.
* Helps to find the **stem word**.
* Helps to group the words into the same category.

#### Disadvantages of Stemming
* It removes the context of the some of the words.
* It is not always accurate.
* It is not suitable for all types of data.


**The disadvantages of stemming are overcome by lemmatization.**


### Porter Stemmer
Porter Stemmer is a stemming algorithm that is used to reduce the words to their root word.

In [2]:
words = ['eating', 'eaten', 'eater', 'eats', 'ate', 'eat', 
         'write', 'written', 'writer', 'writes', 'wrote', 
         'programmer', 'programming', 'program', 'programmed', 
         "going", "goes", "go", "gone", 
         'history', 'historical', 'historian', 'historically',
         'finally', 'finalised', 'final']

from nltk.stem import PorterStemmer
stemming = PorterStemmer()

for word in words:
    print(word, "  --->  ", stemming.stem(word))


eating   --->   eat
eaten   --->   eaten
eater   --->   eater
eats   --->   eat
ate   --->   ate
eat   --->   eat
write   --->   write
written   --->   written
writer   --->   writer
writes   --->   write
wrote   --->   wrote
programmer   --->   programm
programming   --->   program
program   --->   program
programmed   --->   program
going   --->   go
goes   --->   goe
go   --->   go
gone   --->   gone
history   --->   histori
historical   --->   histor
historian   --->   historian
historically   --->   histor
finally   --->   final
finalised   --->   finalis
final   --->   final


* Notice how the words are stemmed to their root word.
* Words like history -> histori => meaning of the word is changing.

In [3]:
stemming.stem('congratulations') # meaning is lost

'congratul'

In [4]:
stemming.stem('sitting') # meaning is preserved

'sit'

## RegexpStemmer Class

RegexpStemmer is used to remove the suffixes from the word, with the help of regular expressions. With the help of this we can easily umplement the Regular Expression Stemmer Algorithms. It basically takes a regular expression and removes any prefix or suffixes from the word.

In [10]:
from nltk.stem import RegexpStemmer
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
# ing, s, e, able are the suffixes that are removed from the word.
# min=4 means that the word should be at least 4 characters long.   

print(reg_stemmer.stem('eating'))   

print(reg_stemmer.stem('ingeating'))


for word in words:
    print(word, "  --->  ", reg_stemmer.stem(word))


eat
ingeat
eating   --->   eat
eaten   --->   eaten
eater   --->   eater
eats   --->   eat
ate   --->   ate
eat   --->   eat
write   --->   writ
written   --->   written
writer   --->   writer
writes   --->   write
wrote   --->   wrot
programmer   --->   programmer
programming   --->   programm
program   --->   program
programmed   --->   programmed
going   --->   go
goes   --->   goe
go   --->   go
gone   --->   gon
history   --->   history
historical   --->   historical
historian   --->   historian
historically   --->   historically
finally   --->   finally
finalised   --->   finalised
final   --->   final


#### Snowball Stemmer
Snowball Stemmer is a stemming algorithm that is used to reduce the words to their root word. It is a more advanced version of the Porter Stemmer. It is used to remove the suffixes from the word, with the help of regular expressions.


In [11]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')

for word in words:
    print(word, "  --->  ", snowball_stemmer.stem(word))

eating   --->   eat
eaten   --->   eaten
eater   --->   eater
eats   --->   eat
ate   --->   ate
eat   --->   eat
write   --->   write
written   --->   written
writer   --->   writer
writes   --->   write
wrote   --->   wrote
programmer   --->   programm
programming   --->   program
program   --->   program
programmed   --->   program
going   --->   go
goes   --->   goe
go   --->   go
gone   --->   gone
history   --->   histori
historical   --->   histor
historian   --->   historian
historically   --->   histor
finally   --->   final
finalised   --->   finalis
final   --->   final


In [12]:
# Compare the output of Snowball Stemmer with the Porter Stemmer.
stemming.stem('fairly'), stemming.stem('sportingly') # meaning is lost

('fairli', 'sportingli')

In [13]:
snowball_stemmer.stem('fairly'), snowball_stemmer.stem('sportingly') # meaning is preserved

('fair', 'sport')

In [14]:
snowball_stemmer.stem('goes')

'goe'