# Text Tokenization and Cleaning with NLTK
@ Sani Kamal, 2019

The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text. It provides good tools for loading and cleaning text that we can use to get our data ready for working with machine learning and deep learning algorithms.

## Install NLTK
You can install NLTK using your favorite package manager, such as pip

In [16]:
# pip install nltk

In [17]:
# import nltk
# nltk.download()
# Or from the command line
# python -m nltk.downloader all

## Load data

In [18]:
# load data
filename = 'data/fireless_cook_book_clean.txt'
file = open(filename,'rt')
text = file.read()
file.close()

## Split into Sentences
NLTK provides the `sent_tokenize()` function to split text into sentences.

In [19]:
from nltk import sent_tokenize

# split into sentences
sentences = sent_tokenize(text)
print(sentences[10])

Roast meats, however, may first be
cooked in the oven and completed in the hay-box or cooker, or they may
be cooked in the hay-box till nearly done and then roasted for a short
time to obtain the crispness which can be given only by cooking with
great heat.


## Split into Words
NLTK provides a function called `word_tokenize()` for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation.

In [20]:
from nltk.tokenize import word_tokenize

# split into words
tokens = word_tokenize(text)
print(tokens[:100])

['THE', 'FIRELESS', 'COOKER', 'Does', 'the', 'idea', 'appeal', 'to', 'you', 'of', 'putting', 'your', 'dinner', 'on', 'to', 'cook', 'and', 'then', 'going', 'visiting', ',', 'or', 'to', 'the', 'theatre', ',', 'or', 'sitting', 'down', 'to', 'read', ',', 'write', ',', 'or', 'sew', ',', 'with', 'no', 'further', 'thought', 'for', 'your', 'food', 'until', 'it', 'is', 'time', 'to', 'serve', 'it', '?', 'It', 'sounds', 'like', 'a', 'fairy-tale', 'to', 'say', 'that', 'you', 'can', 'bring', 'food', 'to', 'the', 'boiling', 'point', ',', 'put', 'it', 'into', 'a', 'box', 'of', 'hay', ',', 'and', 'leave', 'it', 'for', 'a', 'few', 'hours', ',', 'returning', 'to', 'find', 'it', 'cooked', ',', 'and', 'often', 'better', 'cooked', 'than', 'in', 'any', 'other', 'way']


## Filter Out Punctuation

In [21]:
# split into words
tokens = word_tokenize(text)
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

['THE', 'FIRELESS', 'COOKER', 'Does', 'the', 'idea', 'appeal', 'to', 'you', 'of', 'putting', 'your', 'dinner', 'on', 'to', 'cook', 'and', 'then', 'going', 'visiting', 'or', 'to', 'the', 'theatre', 'or', 'sitting', 'down', 'to', 'read', 'write', 'or', 'sew', 'with', 'no', 'further', 'thought', 'for', 'your', 'food', 'until', 'it', 'is', 'time', 'to', 'serve', 'it', 'It', 'sounds', 'like', 'a', 'to', 'say', 'that', 'you', 'can', 'bring', 'food', 'to', 'the', 'boiling', 'point', 'put', 'it', 'into', 'a', 'box', 'of', 'hay', 'and', 'leave', 'it', 'for', 'a', 'few', 'hours', 'returning', 'to', 'find', 'it', 'cooked', 'and', 'often', 'better', 'cooked', 'than', 'in', 'any', 'other', 'way', 'Yet', 'it', 'is', 'true', 'Norwegian', 'housewives', 'have', 'known', 'this', 'for', 'many']


## Filter out Stop Words
`Stop words` are those words that do not contribute to the deeper meaning of the phrase. They are the most common words such as: `the`, `a`, and `is`. For some applications like documentation classification, it may make sense to remove stop words. `NLTK` provides a list of commonly agreed upon stop words for a variety of languages, such as English.

In [22]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Simple pipeline of text preparation
- Load the raw text.
- Split into tokens.
- Convert to lowercase.
- Remove punctuation from each token.
- Filter out remaining tokens that are not alphabetic.
- Filter out tokens that are stop words.

In [23]:
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# load data
filename = 'data/fireless_cook_book_clean.txt'
file = open(filename, 'rt' )
text = file.read()
file.close()

# split into words
tokens = word_tokenize(text)

# convert to lower case
tokens = [w.lower() for w in tokens]

# prepare regex for char filtering
re_punc = re.compile( '[%s]' % re.escape(string.punctuation))

# remove punctuation from each word
stripped = [re_punc.sub( '' , w) for w in tokens]

# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# filter out stop words
stop_words = set(stopwords.words( 'english' ))
words = [w for w in words if not w in stop_words]
print(words[:100])

['fireless', 'cooker', 'idea', 'appeal', 'putting', 'dinner', 'cook', 'going', 'visiting', 'theatre', 'sitting', 'read', 'write', 'sew', 'thought', 'food', 'time', 'serve', 'sounds', 'like', 'fairytale', 'say', 'bring', 'food', 'boiling', 'point', 'put', 'box', 'hay', 'leave', 'hours', 'returning', 'find', 'cooked', 'often', 'better', 'cooked', 'way', 'yet', 'true', 'norwegian', 'housewives', 'known', 'many', 'years', 'european', 'nations', 'used', 'haybox', 'considerable', 'extent', 'although', 'recently', 'wonders', 'become', 'rather', 'widely', 'known', 'talked', 'america', 'original', 'box', 'filled', 'hay', 'gone', 'process', 'evolution', 'become', 'fireless', 'cooker', 'varied', 'form', 'adaptability', 'expect', 'fireless', 'cooker', 'foods', 'cook', 'advantage', 'almost', 'dishes', 'usually', 'prepared', 'boiling', 'steaming', 'well', 'many', 'baked', 'soups', 'boiled', 'braised', 'meats', 'fish', 'sauces', 'fruits', 'vegetables', 'puddings', 'eggs', 'fact', 'almost']


## Stem Words
Stemming refers to the process of reducing each word to its root or base. Some applications, like document classification, may benefit from stemming in order to both reduce the vocabulary and to focus on the sense or sentiment of a document rather than deeper meaning. There are many stemming algorithms, although a popular and long-standing method is the Porter Stemm

In [24]:
from nltk.stem.porter import PorterStemmer

# split into words
tokens = word_tokenize(text)
# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

['the', 'fireless', 'cooker', 'doe', 'the', 'idea', 'appeal', 'to', 'you', 'of', 'put', 'your', 'dinner', 'on', 'to', 'cook', 'and', 'then', 'go', 'visit', ',', 'or', 'to', 'the', 'theatr', ',', 'or', 'sit', 'down', 'to', 'read', ',', 'write', ',', 'or', 'sew', ',', 'with', 'no', 'further', 'thought', 'for', 'your', 'food', 'until', 'it', 'is', 'time', 'to', 'serv', 'it', '?', 'It', 'sound', 'like', 'a', 'fairy-tal', 'to', 'say', 'that', 'you', 'can', 'bring', 'food', 'to', 'the', 'boil', 'point', ',', 'put', 'it', 'into', 'a', 'box', 'of', 'hay', ',', 'and', 'leav', 'it', 'for', 'a', 'few', 'hour', ',', 'return', 'to', 'find', 'it', 'cook', ',', 'and', 'often', 'better', 'cook', 'than', 'in', 'ani', 'other', 'way']
