## **Supplemental Data Cleaning**
> ### ***Stemming***
> Process of reducing inflected (or sometimes derived) words to their word step or root.<br>
> OR<br>
> Crudely chopping off the end of the word to leave only the base. (relate this to a flower/tree)<br>
> This **means**, *taking words with various suffixes and condensing them under the same root word.*<br>
> Thus ***helps in reducing the number of words/tokens.***<br>

>|Words|After Stemming|
>|:-----:|:--------------:|
>|Stemming/Stemmed|***Stem***|
>|Eelctricity/Electrical|***Electr***|
>|BErries/Berry|***Berri***|
>|COnnection/Connected/Connective|***Connect***|

> Stemming is very useful, but ***stemming uses very crude rules, so it isn't perfect***<br>
> eg, `Meanness/Meaning -> Mean`  
> ***[Even thought both the words aren't closely related, they are stemmed down to their root, NOT FAIR]***

> **Benefits of Stemming:**
> - Reduces the corpus of words, the model is exposed to
> - Explicitly correlates words with similar meanings

> ***What are some stemmers?***
> - Porter Stemmer
> - Snowball Stemmer
> - Loancaster Stemmer
> - Regex-based Stemmer

> **Porter Stemmer**<br>
> Use `nltk.PorterStemmer()`

In [1]:
import nltk

ps = nltk.PorterStemmer()

In [2]:
# methods and attributes in PorterStemmer
dir(ps)

['MARTIN_EXTENSIONS',
 'NLTK_EXTENSIONS',
 'ORIGINAL_ALGORITHM',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_apply_rule_list',
 '_contains_vowel',
 '_ends_cvc',
 '_ends_double_consonant',
 '_has_positive_measure',
 '_is_consonant',
 '_measure',
 '_replace_suffix',
 '_step1a',
 '_step1b',
 '_step1c',
 '_step2',
 '_step3',
 '_step4',
 '_step5a',
 '_step5b',
 'mode',
 'pool',
 'stem',
 'vowels']

In [5]:
# example
print(ps.stem('grows'))
print(ps.stem('growing'))
print(ps.stem('grow'))

# thus python now sees these 3 words as same words

grow
grow
grow


In [10]:
# example
print(ps.stem('run'))
print(ps.stem('running'))
print(ps.stem('runner'))

# as run and running are actions whereas runner is not, but all the 3 words can be stemmed down to run,
# stemming can sometime create problems, and isn't always reliable, but in this case it is doing a great job

print(ps.stem('meanness'))
print(ps.stem('meaning'))
# thus, Stemmers aren't always perfect

run
run
runner
mean
mean


In [11]:
# Read in raw text
import pandas as pd
import re
import string
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

In [12]:
# Clean up the data

def clean_text(text):
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

data['text_body_nostop'] = data['text_body'].apply(lambda x: clean_text(x.lower()))
data.head()

Unnamed: 0,label,text_body,text_body_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"


In [13]:
# Stem text

# fucntion to stem
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

data['text_body_stemmed'] = data['text_body_nostop'].apply(lambda x: stemming(x))
data.head()

Unnamed: 0,label,text_body,text_body_nostop,text_body_stemmed
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"


> Thus stemming not a great way to reduce the corpus, hence we use lemmatizing

> ### ***Lemmatizing***
> Process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by **word lemma**. [*Lemma*-set of words]<br>
> eg. *type, typed, typing* are all forms of a same lemma<br>
> Using vocabulary analysis of words aiming to remove inflectional endings to return the dictionary form of a word.<br>
> Lemmatizing ***always returns a dictionary word.***<br>
> <hr>

> **How is lemmatizing different from stemming?**<br>

>|Stemming|Lemmatizing|
>|:--------:|:-----------:|
>|Stemming is typically **faster** as it simply **chops off the end of words** using heuristics, **without any understanding of the context** in which a word is used|It levarages more **informed analysis**, to create **groups of words with similar meaning**, based on the **context around the word, part of speech, or other factors.**|
>|May or **maynot** return a actual ***word in the dictionary***|**Always** returns a ***dictionary word***|
>|Faster|Comparatively Slower|
>|Less Accurate|Very Accurate|
>|less computation cost|Computationally Expensive|

><hr>

> ***Using a Lemmatizer***<br>
> Test [**WordNet lemmatizer**](https://wordnet.princeton.edu/)<br>
> ***WordNet*** is *a collection of nouns, verbs, adjectives, adverbs that are grouped together in **sets of synonyms**, each expressing a distinct concept*.<br>
> Use `nltk.WordNetLemmatizer()`

In [14]:
import nltk

wn = nltk.WordNetLemmatizer()       #lemmatize using WordNetLemmatizer
ps = nltk.PorterStemmer()           #Stem using PorterStemmer

In [15]:
dir(wn)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'lemmatize']

In [17]:
# example
print(ps.stem('meanness'))
print(ps.stem('meaning'))
# stemming gives wrong output (incorrectly stemming)

print(wn.lemmatize('meanness'))
print(wn.lemmatize('meaning'))
# lemmatizing does not condense these words as these are not present in any corpus with multiple synonyms with a core concept/word.

mean
mean
meanness
meaning


In [18]:
# example
print(ps.stem('goose'))
print(ps.stem('geese'))

print(wn.lemmatize('goose'))
print(wn.lemmatize('geese'))

# Now this is very clear that lemmatizing returns a correct word, and stemming shits! 

goos
gees
goose
goose


In [9]:
# Read in raw text
import pandas as pd
import re
import string
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text_body']

  return f(*args, **kwds)


In [20]:
# Clean up the data

def clean_text(text):
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

data['text_body_nostop'] = data['text_body'].apply(lambda x: clean_text(x.lower()))
data.head()

Unnamed: 0,label,text_body,text_body_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"


> **Lemmatize text**

In [22]:
# Lemmatizing using WordNetLemmatizer

# fuction for lemmatizing 
def lemmatizing(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

# apply this fuction to all the rows in 'text_body_nostop' column.
data['text_body_lemmatized'] = data['text_body_nostop'].apply(lambda row: lemmatizing(row))

data.head()

Unnamed: 0,label,text_body,text_body_nostop,text_body_lemmatized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, searching, right, word, thank, breather, promise, wont, take, help, granted, fulfil, promi..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"


> Just like stemmer, lemmatizer ***won't do perticularly well with slang or abbreviations***, thus not so good on this dataset. But would work great when used on a *collection of book reports or journal articles.*