## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [19]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews
## [eating, eat,eaten] --> eat 
## [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [4]:
words

['eating',
 'eats',
 'eaten',
 'writing',
 'writes',
 'programming',
 'programs',
 'history',
 'finally',
 'finalized']

### PorterStemmer

In [5]:
from nltk.stem import PorterStemmer

In [6]:
stemming = PorterStemmer()

for word in words:
    print(word + " ----> "+stemming.stem(word=word))

eating ----> eat
eats ----> eat
eaten ----> eaten
writing ----> write
writes ----> write
programming ----> program
programs ----> program
history ----> histori
finally ----> final
finalized ----> final


In [7]:
stemming.stem('congratulations')

'congratul'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression.

A stemmer that uses regular expressions to identify morphological (Regular Experssion) affixes. Any substrings that match the regular expressions will be removed.

In [49]:
from nltk.stem import RegexpStemmer

Parameters
regexp : str or regexp
The regular expression that should be used to identify morphological affixes.

min : int
The minimum length of string to stem

In [50]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [51]:
reg_stemmer.stem('eating')

'eat'

In [52]:
reg_stemmer.stem('ingeating')

'ingeat'

In [53]:
reg_stemmer = RegexpStemmer('ing|s$|e$|able$', min=4)

In [54]:
reg_stemmer.stem('eating')

'eat'

In [55]:
reg_stemmer.stem('ingeating')

'eat'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

The following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

Invoking the stemmers that way is useful if you do not know the language to be stemmed at runtime. Alternatively, if you already know the language, then you can invoke the language specific stemmer directly:



##### Parameters
language : str or unicode
The language whose subclass is instantiated.

ignore_stopwords : bool
If set to True, stopwords are not stemmed and returned unchanged. Set to False by default.

##### Raises
ValueError
If there is no stemmer for the specified language, a ValueError is raised

In [16]:
from nltk.stem import SnowballStemmer

In [22]:
snowballstemmer = SnowballStemmer(language='english')

In [24]:
for word in words:
    print(word+" ---> "+snowballstemmer.stem(word))

eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> write
writes ---> write
programming ---> program
programs ---> program
history ---> histori
finally ---> final
finalized ---> final


In [25]:
stemming.stem('fairly'), stemming.stem('sportingly')

('fairli', 'sportingli')

In [27]:
snowballstemmer.stem('fairly'), snowballstemmer.stem('sportingly')

('fair', 'sport')