# Clean Text With Python
@ Sani Kamal, 2019

## The Fireless Cook Book, by Margaret Johnes Mitchell

Let’s start off by selecting a dataset. In this kernel, we will use the text from the book
`The Fireless Cook Book, by Margaret Johnes Mitchell`.The full text for `The Fireless Cook Book` is available for free from Project Gutenberg. You can download the ASCII text version of the text here:

- [The Fireless Cook Book, by Margaret Johnes Mitchell](http://www.gutenberg.org/files/60598/60598-0.txt)

Download the file and place it in your current working directory with the file name
`fireless_cook_book.txt`. The file contains header and footer information that we are not interested in, specifically copyright and license information. Open the file and delete the header and footer information and save the file as `fireless_cook_book_clean.txt`.

## Load Data

In [24]:
# load text
filename = 'data/fireless_cook_book_clean.txt'
file = open(filename,'rt')
text = file.read()
file.close()

## Split by Whitespace

In [25]:
# split into words by white space
words = text.split()
print(words[:100])

['THE', 'FIRELESS', 'COOKER', 'Does', 'the', 'idea', 'appeal', 'to', 'you', 'of', 'putting', 'your', 'dinner', 'on', 'to', 'cook', 'and', 'then', 'going', 'visiting,', 'or', 'to', 'the', 'theatre,', 'or', 'sitting', 'down', 'to', 'read,', 'write,', 'or', 'sew,', 'with', 'no', 'further', 'thought', 'for', 'your', 'food', 'until', 'it', 'is', 'time', 'to', 'serve', 'it?', 'It', 'sounds', 'like', 'a', 'fairy-tale', 'to', 'say', 'that', 'you', 'can', 'bring', 'food', 'to', 'the', 'boiling', 'point,', 'put', 'it', 'into', 'a', 'box', 'of', 'hay,', 'and', 'leave', 'it', 'for', 'a', 'few', 'hours,', 'returning', 'to', 'find', 'it', 'cooked,', 'and', 'often', 'better', 'cooked', 'than', 'in', 'any', 'other', 'way!', 'Yet', 'it', 'is', 'true.', 'Norwegian', 'housewives', 'have', 'known', 'this', 'for']


## Select Words
Use regex model (re) and split the document into words by selecting for strings of alphanumeric characters (a-z, A-Z, 0-9 and ‘ ’).

In [26]:
import re

# split based on words only
words = re.split(r'\W+',text)
print(words[:100])

['THE', 'FIRELESS', 'COOKER', 'Does', 'the', 'idea', 'appeal', 'to', 'you', 'of', 'putting', 'your', 'dinner', 'on', 'to', 'cook', 'and', 'then', 'going', 'visiting', 'or', 'to', 'the', 'theatre', 'or', 'sitting', 'down', 'to', 'read', 'write', 'or', 'sew', 'with', 'no', 'further', 'thought', 'for', 'your', 'food', 'until', 'it', 'is', 'time', 'to', 'serve', 'it', 'It', 'sounds', 'like', 'a', 'fairy', 'tale', 'to', 'say', 'that', 'you', 'can', 'bring', 'food', 'to', 'the', 'boiling', 'point', 'put', 'it', 'into', 'a', 'box', 'of', 'hay', 'and', 'leave', 'it', 'for', 'a', 'few', 'hours', 'returning', 'to', 'find', 'it', 'cooked', 'and', 'often', 'better', 'cooked', 'than', 'in', 'any', 'other', 'way', 'Yet', 'it', 'is', 'true', 'Norwegian', 'housewives', 'have', 'known', 'this']


## Split by Whitespace and Remove Punctuation

In [27]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [28]:
re_punc = re.compile( ' [%s] ' % re.escape(string.punctuation))
# remove punctuation from each word
stripped = [re_punc.sub( '' , w) for w in words]
print(stripped[:100])

['THE', 'FIRELESS', 'COOKER', 'Does', 'the', 'idea', 'appeal', 'to', 'you', 'of', 'putting', 'your', 'dinner', 'on', 'to', 'cook', 'and', 'then', 'going', 'visiting', 'or', 'to', 'the', 'theatre', 'or', 'sitting', 'down', 'to', 'read', 'write', 'or', 'sew', 'with', 'no', 'further', 'thought', 'for', 'your', 'food', 'until', 'it', 'is', 'time', 'to', 'serve', 'it', 'It', 'sounds', 'like', 'a', 'fairy', 'tale', 'to', 'say', 'that', 'you', 'can', 'bring', 'food', 'to', 'the', 'boiling', 'point', 'put', 'it', 'into', 'a', 'box', 'of', 'hay', 'and', 'leave', 'it', 'for', 'a', 'few', 'hours', 'returning', 'to', 'find', 'it', 'cooked', 'and', 'often', 'better', 'cooked', 'than', 'in', 'any', 'other', 'way', 'Yet', 'it', 'is', 'true', 'Norwegian', 'housewives', 'have', 'known', 'this']


In [29]:
re_print = re.compile( ' [^%s] ' % re.escape(string.printable))
result = [re_print.sub( '' , w) for w in words]
print(result[200:301])

['sauces', 'fruits', 'vegetables', 'puddings', 'eggs', 'in', 'fact', 'almost', 'everything', 'that', 'does', 'not', 'need', 'to', 'be', 'crisp', 'can', 'be', 'cooked', 'in', 'a', 'simple', 'hay', 'box', 'If', 'the', 'composition', 'of', 'foods', 'and', 'the', 'general', 'principles', 'of', 'cookery', 'are', 'well', 'understood', 'but', 'little', 'special', 'instruction', 'will', 'be', 'needed', 'to', 'enable', 'one', 'to', 'prepare', 'such', 'dishes', 'with', 'success', 'though', 'even', 'a', 'novice', 'may', 'use', 'a', 'fireless', 'cooker', 'if', 'the', 'general', 'directions', 'and', 'explanations', 'as', 'well', 'as', 'the', 'individual', 'recipes', 'are', 'carefully', 'read', 'and', 'followed', 'While', 'such', 'dishes', 'as', 'toast', 'pancakes', 'roast', 'or', 'broiled', 'meats', 'baked', 'bread', 'and', 'biscuits', 'are', 'impossible', 'to', 'cook', 'in', 'the', 'simpler']


## Normalizing Case

In [30]:
# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

['the', 'fireless', 'cooker', 'does', 'the', 'idea', 'appeal', 'to', 'you', 'of', 'putting', 'your', 'dinner', 'on', 'to', 'cook', 'and', 'then', 'going', 'visiting,', 'or', 'to', 'the', 'theatre,', 'or', 'sitting', 'down', 'to', 'read,', 'write,', 'or', 'sew,', 'with', 'no', 'further', 'thought', 'for', 'your', 'food', 'until', 'it', 'is', 'time', 'to', 'serve', 'it?', 'it', 'sounds', 'like', 'a', 'fairy-tale', 'to', 'say', 'that', 'you', 'can', 'bring', 'food', 'to', 'the', 'boiling', 'point,', 'put', 'it', 'into', 'a', 'box', 'of', 'hay,', 'and', 'leave', 'it', 'for', 'a', 'few', 'hours,', 'returning', 'to', 'find', 'it', 'cooked,', 'and', 'often', 'better', 'cooked', 'than', 'in', 'any', 'other', 'way!', 'yet', 'it', 'is', 'true.', 'norwegian', 'housewives', 'have', 'known', 'this', 'for']
