## Stemming and Lemmatization

__What is Stemming?__
Stemming is a method for reducing words to a root form, which is achieved by crudely lopping off the ends of words.  Because of this crude heuristic, the resulting stems won't necessarily be "real words".  

__What is Lemmatication?__ 
In contrast, Lemmatization reduces words to their root forms based on linguistic rules.  As a consequence, lemmas are tokens that are in fact, actual words.  

__Stemming vs Lemmatization__ 
Stemming is faster to implement on text data relative to lemmatization, but is considered cruder.  Both stemming and lemmatization are implementation independent.  As a result, while NLTK and SpaCy can both be used to lemmatize text, SpaCy only supports lemmatization and not stemming, since in its opinion, the former is far superior.    

__Should we always do it?__ 
Just like with Stop Word removal, stemming and lemmatization can be helpful with simpler, more traditional NLP methods, but at the end of the day, you're still removing information.  In deep learning applications, this can cause you to take a hit in terms of model performance, so you should think carefully before doing it, since it is no longer a clear standard practice.  

The same can be said for other simplification methods, like removing punctuation and lowercasing. 

### Stemming and Lemmatizing with NLTK

In [1]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/syu/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [3]:
from nltk import stem
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()

In [4]:
word_list = ['feet', 'foot', 'foots', 'footing']

Lemmatizing: 

In [5]:
[wnl.lemmatize(word) for word in word_list]

['foot', 'foot', 'foot', 'footing']

Stemming:

In [6]:
[porter.stem(word) for word in word_list]

['feet', 'foot', 'foot', 'foot']

### Lemmatizing with SpaCy

In [8]:
import spacy

In [12]:
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)

In [13]:
[lemmatizer.lookup(word) for word in word_list]

['feet', 'foot', 'foots', 'footing']

### Exercises

Lemmatize and Stem: fly, flies, flying

In [14]:
word_list = ['fly', 'flies', 'flying']

In [19]:
# Lemmatize, NLTK
[wnl.lemmatize(word) for word in word_list]

['fly', 'fly', 'flying']

In [20]:
# Lemmatize, SpaCy
[lemmatizer.lookup(word) for word in word_list]

['fly', 'flies', 'flying']

In [21]:
# Stemming, NLTK
[porter.stem(word) for word in word_list]

['fli', 'fli', 'fli']

Lemmatize and Stem: organize, organizes, organizing

In [22]:
word_list = ['organize', 'organizes', 'organizing']

In [23]:
# Lemmatize, NLTK
[wnl.lemmatize(word) for word in word_list]

['organize', 'organizes', 'organizing']

In [24]:
# Lemmatize, SpaCy 
[lemmatizer.lookup(word) for word in word_list]

['organize', 'organizes', 'organizing']

In [25]:
# Stemming, NLTK
[porter.stem(word) for word in word_list]

['organ', 'organ', 'organ']

Lemmatize and Stem: universe, university

In [28]:
word_list = ['universe', 'university']

In [29]:
# Lemmatize, NLTK
[wnl.lemmatize(word) for word in word_list]

['universe', 'university']

In [30]:
# Lemmatize, SpaCy
[lemmatizer.lookup(word) for word in word_list]

['universe', 'university']

In [31]:
# Stemming, NLTK
[porter.stem(word) for word in word_list]

['univers', 'univers']