# 3 NLP Basics
## 3.1 Natural Language Toolkit
### Install NLTK
Installation instructions can be found [here](https://www.nltk.org/install.html).
#### Download NLTK data

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Two useful functions in `dir(nltk)`:
* `pos_tag`: speech tagging
* `tokenize`: take a sentence and split it into a list of words

### Quick look at some stop words in NLTK
Stop words are basically words that are used very frequently but don't really contribute much to the meaning of a sentence. In sentiment analysis, these words are sentiment-neutral. We can go ahead and safely drop these.

There are stop words in different languages, here we print out the first 10 English stop words.

In [2]:
from nltk.corpus import stopwords

stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

There are a lot of pronouns above. Let's look further down the list.

In [3]:
stopwords.words('english')[0:500:25]

['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']

## 3.2 Read in & explore text data
### Read in semi-structured text data

In [4]:
# Read in the raw text
rawData = open('SMSSpamCollection').read()

# Print the raw data
rawData[0:200]

'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\nham\tOk lar... Joking wif u oni...\nspam\tFree entry in 2 a wkly comp to win FA Cup fin'

In [5]:
parsedData = rawData.replace('\t', '\n').split('\n')

parsedData[0:4]

['ham',
 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'ham',
 'Ok lar... Joking wif u oni...']

In [6]:
labelList = parsedData[0::2]
textList = parsedData[1::2]

print(labelList[0:5])
print(textList[0:5])

['ham', 'ham', 'spam', 'ham', 'ham']
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though"]


In [7]:
# Check length in each list
print(len(labelList))
print(len(textList))

5575
5574


In [8]:
# Check last few elements
print(labelList[-5:])

['ham', 'ham', 'ham', 'ham', '']


In [9]:
# Create dataframe to store all structured text data
import pandas as pd

fullCorpus = pd.DataFrame({
    'label': labelList[:-1],
    'text': textList
})

fullCorpus.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [38]:
# Easy way to load text data
dataset = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
dataset.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


### Explore the dataset

In [39]:
# Set column names for dataset
dataset.columns = ['label', 'body_text']
dataset.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [40]:
# Shape of dataset
print('Input data has {} rows and {} columns'.format(len(dataset), len(dataset.columns)))

Input data has 5572 rows and 2 columns


In [41]:
# Spam/ham numbers
print('Out of {} rows, {} are spam, {} are ham'.format(len(dataset),
                                                       len(dataset[dataset['label']=='spam']),
                                                       len(dataset[dataset['label']=='ham'])))

Out of 5572 rows, 747 are spam, 4825 are ham


In [42]:
# Check missing data
print('Number of null in label: {}'.format(dataset['label'].isnull().sum()))
print('Number of null in text: {}'.format(dataset['body_text'].isnull().sum()))

Number of null in label: 0
Number of null in text: 0


## 3.3 Regular Expressions
* `regular`: search for explicit "nlp" string within some other string, **"regular expression" &mdash; "regular"**
* `[l-r]`: search for all single characters between 'l' and 'r' in any text, return single characters at a time, **"regular" &mdash; 'r', 'l', 'r'**
* `[a-u]+`: search for any character between 'a' and 'u' with added flexibility of returning strings of multiple characters together that are between 'a' and 'u', **"regular expression" &mdash; "regular", 'e', "pression"**
* `[0-9]+`: return all numbers with flexibility of returning sequences of more than one number, **"year2019" &mdash; "2019"**
* `[a-u0-9]+`: search for sequences of characters between 'a' and 'u' or numbers between 0 and 9, **"regular expression year2019" &mdash; "regular", 'e', "pression", "ear2019"**

### Use  regular expressions
In Python, `re` package is the most commonly used regex resource. More details can be found [here](https://docs.python.org/3/library/re.html).

In [43]:
import re

### Split a sentence into a list of words

In [44]:
# Split sentence with single whitespaces
re.split('\s', textList[0])

['Go',
 'until',
 'jurong',
 'point,',
 'crazy..',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet...',
 'Cine',
 'there',
 'got',
 'amore',
 'wat...']

In [45]:
# Split sentence with non-word characters
re.split('\W+', textList[0])

['Go',
 'until',
 'jurong',
 'point',
 'crazy',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'Cine',
 'there',
 'got',
 'amore',
 'wat',
 '']

In [46]:
# Search one or more word characters
re.findall('\w+', textList[0])

['Go',
 'until',
 'jurong',
 'point',
 'crazy',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'Cine',
 'there',
 'got',
 'amore',
 'wat']

#### Notes
* Useful methods for tokenizing: `findall()`, `split()`
* Useful regexes for tokenizing: `\W` & `\w` &mdash; words, `\S` & `\s` &mdash; whitespaces

### Other useful regex methods
* `re.sub('pattern', 'replace', string)`: replace a specific string
* `re.search()`
* `re.match()`
* `re.fullmatch()`
* `re.finditer()`
* `re.escape()`

## 3.4 Clean text
### Pre-processing text data
Cleaning up the text data is necessary or highlight attributes that machine learning system to pick up. Cleaning (or pre-processing) the text data typically consists of a number of steps:
1. Remove punctuation
2. Tokenization
3. Remove stopwords
4. Lemmatize/Stem

First three steps are covered in following as they're implemented in pretty much any text cleaning pipeline, lemmatizing and stemming are helpful but not critical.

In [47]:
pd.set_option('display.max_colwidth', 100)

dataset.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


### Remove punctuation

In [48]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [49]:
def remove_punct(text):
    text_nopunct = ''.join([char for char in text if char not in string.punctuation])
    return text_nopunct

dataset['body_text_clean'] = dataset['body_text'].apply(lambda x: remove_punct(x))
dataset.head()

Unnamed: 0,label,body_text,body_text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though


### Tokenization

In [50]:
def tokenize(text):
    tokens = re.findall('\w+', text)
    return tokens

dataset['body_text_tokenized'] = dataset['body_text_clean'].apply(lambda x: tokenize(x.lower()))

dataset.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"


### Remove stopwords

In [51]:
stopword = nltk.corpus.stopwords.words('english')

def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

dataset['body_text_nostop'] = dataset['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

dataset.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"


## 3.5 Stemming & Lemmatizing &mdash; Supplemental Data Cleaning
### Test Porter Stemmer

In [52]:
ps = nltk.PorterStemmer()

print(ps.stem('use'))
print(ps.stem('uses'))
print(ps.stem('used'))
print(ps.stem('using'))
print(ps.stem('useful'))
print(ps.stem('useless'))
print(ps.stem('user'))

use
use
use
use
use
useless
user


### Rewrite clean up text function in one

In [53]:
# Reload original data
data = pd.read_csv('SMSSpamCollection', sep='\t')
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives around here though"
4,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...


In [54]:
def clean_text(text):
    text = ''.join([char for char in text if char not in string.punctuation])
    tokens = re.findall('\w+', text)
    text = [word for word in tokens if word not in stopword]
    return text

data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

data.head()

Unnamed: 0,label,body_text,body_text_nostop
0,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
3,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"
4,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...,"[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send..."


### Stem text

In [55]:
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

data['body_text_stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))

data.head()

Unnamed: 0,label,body_text,body_text_nostop,body_text_stemmed
0,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]"
3,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
4,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...,"[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send...","[freemsg, hey, darl, 3, week, word, back, id, like, fun, still, tb, ok, xxx, std, chg, send, 150..."


Stemmer won't do a great job with slang or abbreviations.

### Test out WordNet lemmatizer (more about WordNet is [here](https://wordnet.princeton.edu/))

In [56]:
wn = nltk.WordNetLemmatizer()

print(wn.lemmatize('used'))
print(wn.lemmatize('useful'))

used
useful


In [57]:
print(ps.stem('feet'))
print(ps.stem('foot'))
print(wn.lemmatize('feet'))
print(wn.lemmatize('foot'))

feet
foot
foot
foot


### Lemmatize text

In [58]:
def lemmatizing(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

data['body_text_lemmatized'] = data['body_text_nostop'].apply(lambda x: lemmatizing(x))

data.head()

Unnamed: 0,label,body_text,body_text_nostop,body_text_stemmed,body_text_lemmatized
0,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]","[u, dun, say, early, hor, u, c, already, say]"
3,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]","[nah, dont, think, go, usf, life, around, though]"
4,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...,"[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send...","[freemsg, hey, darl, 3, week, word, back, id, like, fun, still, tb, ok, xxx, std, chg, send, 150...","[freemsg, hey, darling, 3, week, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send,..."


Like the stemmer, the lemmatizer won't do particularly well with slang or abbreviations, so it's not ideal for this data set. It might be much more effective if it was used on a collection of book reports or journal articles. Both stemming and lemmatizing help reduce the corpus of words that the model is exposed to, and it explicitly correlates words with similar meaning. The lemmatizer is typically more accurate than the stemmer but the trade-off is that it takes a little bit longer to run.