# Natural Language Processing (NLP) Prerequisites
___
## Guided Assignment

Before we even start thinking about machine learning models, we typically have to perform these data pre-processing/cleaning steps first:

> [0. Load and Investigate Data](#Load-and-Investigative-Data)
>
> [1. Remove Punctuation](#Remove-Punctuation)
>
> [2. Tokenize](#Tokenize)
>
> [3. Remove Stopwords](#Remove-Stopwords)
>
> [4.1 Stem](#Stem)
>
> [4.2 Lemmatize](#Lemmatize)

We're refer to this set of steps as our [**pre-processing pipeline**](#Pre-Processing-Pipeline).

Below, we guide you through each of these steps.

> ### Load and Investigate Data

In [8]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label','text']

data.head()

Unnamed: 0,label,text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [44]:
print('Our corpus has {} samples with {} features.'.format(len(data),len(data.columns)))
print('There are {} samples labeled ham and {} samples labeled spam.'.format((data['label']=='ham').sum(),
                                                                             (data['label']=='spam').sum()))
print('# of missing labels: {}'.format(data['label'].isnull().sum()))
print('# of missing sentences: {}'.format(data['text'].isnull().sum()))

Our corpus has 5568 samples with 7 features.
There are 4822 samples labeled ham and 746 samples labeled spam.
# of missing labels: 0
# of missing sentences: 0


> ### Remove Punctuation

In [9]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Create a function named `remove_punctuation` that takes in data (in the form of a string), removes punctuation characters, joins all characters together (remember that we broke the string down into its characters to check for punctuation), and returns our punctuation-free string.

Then, apply this function to our Pandas dataframe using the `.apply()` method using the `lambda` function. 

Template code has been given to you. It is your job to complete it. 

In [12]:
def remove_punctuation(text):
    cleaned_text = ''.join([char for char in text if char not in string.punctuation])
    return cleaned_text

data['cleaned'] = data['text'].apply(lambda x: remove_punctuation(x))
data.head()

Unnamed: 0,label,text,cleaned
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


> ### Tokenize

In [15]:
import re

def tokenize(cleaned_text):
    tokenized_text = re.split('\W+',cleaned_text)
    return tokenized_text

data['tokenized'] = data['cleaned'].apply(lambda x: tokenize(x.lower()))
data.head()

Unnamed: 0,label,text,cleaned,tokenized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


> ### Remove Stopwords

In [17]:
import nltk
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(tokenized_text):
    no_stopword_text = [word for word in tokenized_text if word not in stopwords]
    return no_stopword_text

data['no stopwords'] = data['tokenized'].apply(lambda x: remove_stopwords(x))
data.head()

Unnamed: 0,label,text,cleaned,tokenized,no stopwords
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


> ### Stem

In [20]:
ps = nltk.PorterStemmer()

def stem(no_stopword_text):
    stemmed_text = [ps.stem(word) for word in no_stopword_text]
    return stemmed_text

data['stemmed'] = data['no stopwords'].apply(lambda x: stem(x))
data[['label','text','no stopwords','stemmed']].head()

Unnamed: 0,label,text,no stopwords,stemmed
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"


> ### Lemmatize

In [22]:
wn = nltk.WordNetLemmatizer()

def lemmatize(no_stopword_text):
    lemmatized_text = [wn.lemmatize(word) for word in no_stopword_text]
    return lemmatized_text

data['lemmatized'] = data['no stopwords'].apply(lambda x: lemmatize(x))
data[['label','text','no stopwords','lemmatized']].head()

Unnamed: 0,label,text,no stopwords,lemmatized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, searching, right, word, thank, breather, promise, wont, take, help, granted, fulfil, promi..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"


> ### Pre-Processing Pipeline

In [47]:
import pandas as pd
import string
import re
import nltk

pipeline_data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
pipeline_data.columns = ['label','text']

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()
wn = nltk.WordNetLemmatizer()

def preprocessing_pipeline(text, process, stem_lem):
    cleaned_text = ''.join([char for char in text if char not in string.punctuation])
    tokenized_text = re.split('\W+',cleaned_text)
    if process == 'stem':
        stemmed_text = [ps.stem(word) for word in tokenized_text if word not in stopwords]
        return stemmed_text
    elif process == 'lem':
        lemmatized_text = [wn.lemmatize(word) for word in tokenized_text if word not in stopwords]
        return lemmatized_text

pipeline_data['processed (stem)'] = pipeline_data['text'].apply(lambda x: preprocessing_pipeline(x,'stem',ps))
pipeline_data['processed (lem)'] = pipeline_data['text'].apply(lambda x: preprocessing_pipeline(x,'lem',wn))
pipeline_data.head(10)

Unnamed: 0,label,text,processed (stem),processed (lem)
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, search, right, word, thank, breather, i, promis, wont, take, help, grant, fulfil, promis, ...","[Ive, searching, right, word, thank, breather, I, promise, wont, take, help, granted, fulfil, pr..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[Free, entry, 2, wkly, comp, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, i, dont, think, goe, usf, live, around, though]","[Nah, I, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, they, treat, like, aid, patent]","[Even, brother, like, speak, They, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[i, have, a, date, on, sunday, with, will]","[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]"
5,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,"[as, per, request, mell, mell, oru, minnaminungint, nurungu, vettam, set, callertun, caller, pre...","[As, per, request, Melle, Melle, Oru, Minnaminunginte, Nurungu, Vettam, set, callertune, Callers..."
6,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...,"[winner, as, valu, network, custom, select, receivea, 900, prize, reward, to, claim, call, 09061...","[WINNER, As, valued, network, customer, selected, receivea, 900, prize, reward, To, claim, call,..."
7,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...,"[had, mobil, 11, month, u, r, entitl, updat, latest, colour, mobil, camera, free, call, the, mob...","[Had, mobile, 11, month, U, R, entitled, Update, latest, colour, mobile, camera, Free, Call, The..."
8,ham,"I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ...","[im, gonna, home, soon, dont, want, talk, stuff, anymor, tonight, k, ive, cri, enough, today]","[Im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, Ive, cried, enough, today]"
9,spam,"SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, ...","[six, chanc, win, cash, from, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6day, 1...","[SIX, chance, win, CASH, From, 100, 20000, pound, txt, CSH11, send, 87575, Cost, 150pday, 6days,..."
