# Speech and Language Processing
### An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (1st. ed.). Prentice Hall PTR, USA.

### Stemming

Stemming is a technique in Natural Language Processing (NLP) used to reduce words to their base or root form, known as the stem. The goal of stemming is to improve the efficiency and accuracy of text processing by reducing the number of unique words that need to be processed. Stemming is particularly useful in applications such as search engines or sentiment analysis, where the focus is on the underlying meaning of the words rather than their exact form.

Stemming works by removing the suffixes from words to obtain the root form. For example, the stem of the words "running", "runs", and "run" is "run". This process can be done using a variety of algorithms, including the Porter stemming algorithm, the Snowball stemming algorithm, and the Lancaster stemming algorithm, among others.

One of the challenges of stemming is that it can sometimes result in the stem being a non-word, meaning a word that is not found in the dictionary. For example, the stem of the word "better" using the Porter stemming algorithm is "bet", which is not a word. This can sometimes lead to errors in text processing and analysis. To mitigate this, techniques such as lemmatization can be used instead of stemming, which involves reducing words to their dictionary form instead of just the stem.

Despite these challenges, stemming remains a valuable technique in NLP for reducing the number of unique words that need to be processed, thus improving the efficiency and accuracy of text processing. It is commonly used in search engines, text classification, and sentiment analysis, among other applications. However, it is important to carefully consider the specific algorithm and parameters used for stemming, as well as any potential limitations or drawbacks of the technique.

<img src="https://d2mk45aasx86xg.cloudfront.net/difference_between_Stemming_and_lemmatization_8_11zon_452539721d.webp">

In [10]:
from nltk.stem import PorterStemmer, LancasterStemmer

words = ['change', 'changing', 'happy', 'happier']

porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()

stemmed_full = [(porter_stemmer.stem(word), lancaster_stemmer.stem(word)) for word in words]

print("Porter Stemmer, Lancaster Stemmer")
for stem_word in stemmed_full:
    print(stem_word)

Porter Stemmer, Lancaster Stemmer
('chang', 'chang')
('chang', 'chang')
('happi', 'happy')
('happier', 'happy')
