## Stemming and Lemmatization

Stemming and Lemmatization are text processing techniques to reduce inflected words (variations of a base word) to their common base form, but they achieve this in different ways.

### Stemming:

Stemming is a rule-based approach that removes suffixes from words to obtain a morphological stem. <br><br>
This stem might not necessarily be a real word, but it captures the core meaning of the inflected word. <br><br>
For example, stemming the words "running", "runs", and "ran" would all result in the stem "run".

In [3]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
 
stemmer = PorterStemmer()
 
sentence = "The quick brown foxes are jumping over the lazy dogs"
words = word_tokenize(sentence)
 
for word in words:
    print(word + ": " + stemmer.stem(word))

The: the
quick: quick
brown: brown
foxes: fox
are: are
jumping: jump
over: over
the: the
lazy: lazi
dogs: dog


### Lemmatization:
Lemmatization, on the other hand, takes a more linguistic approach. <br><br>
It uses dictionaries and morphological analysis to map inflected words to their canonical form, also known as the lemma. <br><br>
Unlike stemming, lemmatization always results in a valid word, ensuring consistency and accuracy. <br><br>
For instance, lemmatizing "changing", "changed", and "changes" would all result in the lemma "change".

In [4]:
import spacy

nlp = spacy.load('en_core_web_sm')

text = "The quick brown foxes are jumping over the lazy dogs"

text = nlp(text)

lemmatized_tokens = [token.lemma_ for token in text]

for original, lemmatized in zip(text,lemmatized_tokens):
    print(str(original) + ": " + lemmatized)

The: the
quick: quick
brown: brown
foxes: fox
are: be
jumping: jump
over: over
the: the
lazy: lazy
dogs: dog
