## Stemming

* Stemming is the process of removing suffix from word to obtain base or root word i.e., to reduce inflectional form of word to base word.
* Stemming will chop-off ‘s’, ‘es’, ‘ed’, ‘ing’, ‘ly’ etc from the end of the words and sometimes the conversion is not desirable. But nonetheless, stemming helps us in standardizing text.

## nltk

In [13]:
import nltk

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize

In [14]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### Porter Stemmer

This is one of the most common and gentle stemmer. It is very fast but not precise enough.


In [15]:
porter_stemmer = PorterStemmer()

In [16]:
def stem_words_porter(text):
  words = word_tokenize(text)

  stem_words = [stemmer.stem(word) for word in words]

  return stem_words

In [29]:
input_text = "SpaceX is an American aerospace manufacturer, space transportation services and communications company headquartered in Hawthorne, California. It was established by Elon Musk"

stem_words_porter(input_text)

['spacex',
 'is',
 'an',
 'american',
 'aerospac',
 'manufactur',
 ',',
 'space',
 'transport',
 'servic',
 'and',
 'commun',
 'compani',
 'headquart',
 'in',
 'hawthorn',
 ',',
 'california',
 '.',
 'It',
 'wa',
 'establish',
 'by',
 'elon',
 'musk']

#### Snowball Stemmer

* There were some improvements done on Porter Stemmer which made it more precise over large datasets

* One feature of Snowball Stemmer is that it will ignore stemming of Stopwords

In [30]:
snowball_stemmer = SnowballStemmer(language="english")

In [31]:
def stem_words_snowball(text):
  words = word_tokenize(text)

  stem_words = [snowball_stemmer.stem(word) for word in words]

  return stem_words

In [32]:
stem_words_snowball(input_text)

['spacex',
 'is',
 'an',
 'american',
 'aerospac',
 'manufactur',
 ',',
 'space',
 'transport',
 'servic',
 'and',
 'communic',
 'compani',
 'headquart',
 'in',
 'hawthorn',
 ',',
 'california',
 '.',
 'it',
 'was',
 'establish',
 'by',
 'elon',
 'musk']

You can see "was" is handled well by Snowball Stemmer compared to PorterStemmer.

#### Lancaster Stemmer
* This very aggressive Stemmer and will hugely trim down the vocabulary
* It is fast but not quite advisable as the base word will not be much accurate

In [24]:
lancaster = LancasterStemmer()

In [27]:
def stem_words_lancaster(text):
  words = word_tokenize(text)

  stem_words = [lancaster.stem(word) for word in words]

  return stem_words

In [28]:
stem_words_lancaster(input_text)

['spacex',
 'is',
 'an',
 'am',
 'aerospac',
 'manufact',
 ',',
 'spac',
 'transport',
 'serv',
 'and',
 'commun',
 'company',
 'headquart',
 'in',
 'hawthorn',
 ',',
 'californ']

### spacy

It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only.


Problems with Stemming

***Ex:*** Root word of **services** will be given as **servic** which is not correct as shown in example