# Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for **jump** might also return **jumps** and **jumping**. **Jumop** is the **stem** for [jumps, jumped, jumping].

Stemming essentially chops off letters from the end of a word until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. 

`SpaCy` doesn't include a stemmer, opting instead to rely entirely on the techniques of lemmatization.


So we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.org/

## Porter Stemmer

One of the most common - and effective - stemming tools is [*Porter's stemming Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). 

The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are defined. 

Read the notes on Blackboard for more information on stemming rules.

In [16]:
# Import the NLTK toolkit and the Porter Stemmer library
import nltk

from nltk.stem.porter import PorterStemmer

In [17]:
porter_stemmer = PorterStemmer()

In [18]:
# Pass the sample words as individual strings
sample_words = ["caresses", "ponies", "pony", "cats", "running", "runner", "climber", "easily", "quickly"]

In [19]:
for word in sample_words:
    print (word + " ------> " + porter_stemmer.stem(word))

caresses ------> caress
ponies ------> poni
pony ------> poni
cats ------> cat
running ------> run
runner ------> runner
climber ------> climber
easily ------> easili
quickly ------> quickli


In [None]:
The stemmer 

## Porter2 Stemmer
This stemming language was also developed by Martin Porter. It provides a slight improvement over the original Porter stemmer shown above, both in logic and speed. Since **nltk** uses the name SnowballStemmer, we'll use it in this example instead of the Porter2 stemmer.

In [21]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer instance requires you to pass a language parameter
snowball_stemmer = SnowballStemmer(language="english")

Now I'll pass in the same string of words as shown above.

In [24]:
sample_words = ["caresses", "ponies", "pony", "cats", "running", "runner", "climber", "easily", "quickly"]
for word in sample_words:
    print(word + " ----> " + snowball_stemmer.stem(word))

caresses ----> caress
ponies ----> poni
pony ----> poni
cats ----> cat
running ----> run
runner ----> runner
climber ----> climber
easily ----> easili
quickly ----> quick


Here the Porter2 (snowball) stemmer provided the same output as the original Porter stemmer, with the exception that is was able to determine that quickly should be stemmed to the word quick.

Lemmatisation is probably a better metod to use than stemming.