**-----------------------------------------------------------------------------------------------------------------**

*In this lecture we are going to explore:*

1. What is The Role of Stemming in NLP?
2. What are Basics of Stemming?
3. What are Common Regex Functions used in NLP?
4. What are the Errors that Could Occur in Stemming?

**-----------------------------------------------------------------------------------------------------------------**

# 2.6 Stemming

The next step in preprocessing is to standardise the text. One option for this is stemming, where words are reduced to their base form. For example, words like ‘connecting’ or ‘connected’ will be stemmed to the base form ‘connect’. Stemming works by removing suffix/ending of word but can sometimes lead to the base form not being meaningful or a proper word.

We standardize the text in this way because it will lower the number of unique words in our dataset; therefore reducing the size and complexity of our data. Removing complexity and noise from the data is an important step for preparing our data properly for machine learning.

![2.6_1_Stemming_53678d43bc.png](attachment:a374d225-5e6f-4648-9fde-2c612de1d467.png)

![2.6_2_stemming.png](attachment:1de04aaf-e15e-4e1d-97bd-cbe2aec276c5.png)

In [1]:
from nltk.stem import PorterStemmer

In [2]:
# create stemmer
ps = PorterStemmer()

In [3]:
connect_tokens = ['connecting', 'connected', 'connectivity', 'connect', 'connects']

In [4]:
for t in connect_tokens:
    print(t, " : ", ps.stem(t))

connecting  :  connect
connected  :  connect
connectivity  :  connect
connect  :  connect
connects  :  connect


In [5]:
learn_tokens = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']

In [6]:
for t in learn_tokens:
    print(t, " : ", ps.stem(t))

learned  :  learn
learning  :  learn
learn  :  learn
learns  :  learn
learner  :  learner
learners  :  learner


In [7]:
likes_tokens = ['likes', 'better', 'worse']

In [8]:
for t in likes_tokens:
    print(t, " : ", ps.stem(t))

likes  :  like
better  :  better
worse  :  wors


**What are the Errors that Could Occur in Stemming?**

**Under-Stemming**
Under-stemming is the opposite of over-stemming, where words that should be reduced to the same root form are not. 

This can introduce unnecessary complexity into the feature set and hinder the effectiveness of text analytics algorithms.

Let’s take the example of ‘alumnus’. The different forms would be:

Alumni

Alumna

Alumnae

It should be noted that this English word has a Latin morphology and these synonyms are not combined.

**Over-Stemming**
Over-stemming happens when words that should be stemmed into different root forms are incorrectly stemmed to the same root. 

This can lead to a significant loss of meaning and impact the accuracy of text analytics.

For example, the terms are reduced to ‘univers’ for search queries like:

University

Universal

Universe

In this case, even though these three words are etymologically related, their meanings are completely different.

