## NLP with Python and Machine Learning
### Chapter 3: Stemming and Lemmatizing

* Stemming is converting a word to it's root, eg, reducing stemming to stem.
There are 4 types of stemmers included in the nltk library, we'll use Porter Stemmer

* Lemmatizing is grouping together inflected forms of a word so they can be analyzed as a single term, identified by the word's lemma

Both stemming and lemmatizing are similar as they reduce a word to its root. But they are different in the way they are implemented. Stemming usually involves chopping down end letters, so its simple but may or may not return a meaningful word. Lemmatizing uses more context and always returns a meaningful word

[Tutorial link here](https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/introducing-stemming?autoAdvance=true&autoSkip=true&autoplay=true&resume=false)

In [68]:
import numpy as np
import pandas as pd
import re
import pdb
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [69]:
ps=nltk.PorterStemmer()
wn=nltk.WordNetLemmatizer()

In [70]:
print(ps.stem('grows'))
print(ps.stem('growing'))
print(ps.stem('grow'))

print(ps.stem('runs'))
print(ps.stem('running'))
print(ps.stem('runn'))

grow
grow
grow
run
run
runn


In [71]:
print(ps.stem('meanness'))
print(ps.stem('meaning'))

print(wn.lemmatize('meanness'))
print(wn.lemmatize('meaning'))

mean
mean
meanness
meaning


In [73]:
print('===== Stemmer =======')
print(ps.stem('goose'))
print(ps.stem('geese'))

print('===== Lemmatizer ======')
print(wn.lemmatize('goose'))
print(wn.lemmatize('geese'))

goos
gees
goose
goose


In [63]:
dataset = pd.read_pickle('Spam_ch2.pkl')
dataset.head(5)

Unnamed: 0,label,text,text_clean,remove_stop
0,ham,I've been searching for the right words to tha...,"[Ive, been, searching, for, the, right, words,...","[Ive, searching, right, words, thank, breather..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[Free, entry, 2, wkly, comp, win, FA, Cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, dont, think, he, goes, to, usf, he, l...","[Nah, I, dont, think, goes, usf, lives, around..."
3,ham,Even my brother is not like to speak with me. ...,"[Even, my, brother, is, not, like, to, speak, ...","[Even, brother, like, speak, They, treat, like..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]"


In [74]:
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

def lemmatize(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

dataset['stemmed_data'] = dataset['remove_stop'].apply(lambda x: stemming(x))
dataset['lemmatized_data'] = dataset['remove_stop'].apply(lambda x: lemmatizeatize(x))
dataset.head()

Unnamed: 0,label,text,text_clean,remove_stop,stemmed_data,lemmatized_data
0,ham,I've been searching for the right words to tha...,"[Ive, been, searching, for, the, right, words,...","[Ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, I,...","[Ive, searching, right, word, thank, breather,..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[Free, entry, 2, wkly, comp, win, FA, Cup, fin...","[free, entri, 2, wkli, comp, win, FA, cup, fin...","[Free, entry, 2, wkly, comp, win, FA, Cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...","[Nah, I, dont, think, he, goes, to, usf, he, l...","[Nah, I, dont, think, goes, usf, lives, around...","[nah, I, dont, think, goe, usf, live, around, ...","[Nah, I, dont, think, go, usf, life, around, t..."
3,ham,Even my brother is not like to speak with me. ...,"[Even, my, brother, is, not, like, to, speak, ...","[Even, brother, like, speak, They, treat, like...","[even, brother, like, speak, they, treat, like...","[Even, brother, like, speak, They, treat, like..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[I, have, A, date, ON, sunday, with, will]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]"


In [75]:
dataset.to_pickle('Spam_stem_lemma.pkl')