# 2.6 Stemming

The next step in preprocessing is to standardise the text. One option for this is stemming, where words are reduced to their base form. For example, words like ‘connecting’ or ‘connected’ will be stemmed to the base form ‘connect’. Stemming works by removing suffix/ending of word but can sometimes lead to the base form not being meaningful or a proper word.

We standardize the text in this way because it will lower the number of unique words in our dataset; therefore reducing the size and complexity of our data. Removing complexity and noise from the data is an important step for preparing our data properly for machine learning.

In [2]:
from nltk.stem import PorterStemmer

In [3]:
# create stemmer
ps = PorterStemmer()

In [4]:
connect_tokens = ['connecting', 'connected', 'connectivity', 'connect', 'connects']

In [5]:
for t in connect_tokens:
    print(t, " : ", ps.stem(t))

connecting  :  connect
connected  :  connect
connectivity  :  connect
connect  :  connect
connects  :  connect


In [6]:
learn_tokens = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']

In [7]:
# observe how different forms reduce to the root
for t in learn_tokens:
    print(t, " : ", ps.stem(t))

learned  :  learn
learning  :  learn
learn  :  learn
learns  :  learn
learner  :  learner
learners  :  learner


In [8]:
likes_tokens = ['likes', 'better', 'worse']

In [9]:
for t in likes_tokens:
    print(t, " : ", ps.stem(t))

likes  :  like
better  :  better
worse  :  wors


## Another example

In [10]:
run_tokens = ['running', 'runner', 'ran', 'runs']
for t in run_tokens:
    print(t, " : ", ps.stem(t))

running  :  run
runner  :  runner
ran  :  ran
runs  :  run


## What I Learned

- PorterStemmer **reduces words to their root form** (not always linguistically correct)

- It's rule-based, so it may return non-standard stems (connectivity → connectiv)

- It does not handle irregular forms well (ran → ran, better → better)

- Good for feature reduction in classic NLP