## Data Preparation for NLP tasks

### How to clean text manually and with NLTK

#### 1. Metamorphosis dataset

In [1]:
!ls

 conv_net.png		    Models			  rec_net.png
 Data			    multi_perc_graph.png	  Weights
'Deep Learning NLP.ipynb'  'NLP Data Preparation.ipynb'


In [2]:
# Download the dataset for cleaning
!wget -O Data/pg5200.txt http://www.gutenberg.org/cache/epub/5200/pg5200.txt

--2020-03-27 00:35:41--  http://www.gutenberg.org/cache/epub/5200/pg5200.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 141420 (138K) [text/plain]
Saving to: ‘Data/pg5200.txt’


2020-03-27 00:35:45 (46.3 KB/s) - ‘Data/pg5200.txt’ saved [141420/141420]



In [3]:
!mv Data/pg5200.txt Data/metamorphosis.txt

In [4]:
!ls Data

metamorphosis.txt


In [5]:
!cat Data/metamorphosis.txt

﻿The Project Gutenberg EBook of Metamorphosis, by Franz Kafka
Translated by David Wyllie.

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **
**     Please follow the copyright guidelines in this file.     **


Title: Metamorphosis

Author: Franz Kafka

Translator: David Wyllie

Release Date: August 16, 2005 [EBook #5200]
First posted: May 13, 2002
Last updated: May 20, 2012

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK METAMORPHOSIS ***




Copyright (C) 2002 David Wyllie.





  Metamorphosis
  Franz Kafka

Translated by David Wyllie



I


One morning, when Gregor Samsa woke from troubled dreams, he found
himself

In [6]:
# Clean the above terxt manually, removing the header and footer information and read the cleaned data
!cat Data/metamorphosis_clean.txt

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.  He lay on
his armour-like back, and if he lifted his head a little he could
see his brown belly, slightly domed and divided by arches into stiff
sections.  The bedding was hardly able to cover it and seemed ready
to slide off any moment.  His many legs, pitifully thin compared
with the size of the rest of him, waved about helplessly as he
looked.

"What's happened to me?" he thought.  It wasn't a dream.  His room,
a proper human room although a little too small, lay peacefully
between its four familiar walls.  A collection of textile samples
lay spread out on the table - Samsa was a travelling salesman - and
above it there hung a picture that he had recently cut out of an
illustrated magazine and housed in a nice, gilded frame.  It showed
a lady fitted out with a fur hat and fur boa who sat upright,
raising a heavy fur muff that cove

#### 2. Text cleaning is task specific

#### Looking at the dataset:

    1. It's plain text, no markup.
    2. It uses UK English.
    3. The lines are artificially wrapped with new lines at about 70 characters.
    4. There arent any obvious typos.
    5. There's heavy use of punctuation e.g. commas, apostrophes, etc.
    6. There's hyphenated descriptions like armour-like.
    7. There's a lot of use of the em dash (-) to continue sentences (maybe consider replacing with commas?).
    8. There are names.
    9. There does not appear to be numbers that require handling.
    10. There are section markers (II, III)

#### Some possible objectives

    1. If we wanted to build some sort of Kafkaesque language model, we may want to keep all of the case, quotes, and other punctuations in place.
    
    2. If we were interested in classifying documents as Kafka and Not Kafka, maybe we would want to strip case, punctuation, and even trim words back to their stem.
    

**Note** Use your task as the lens by which to choose how to ready your data.

#### 3. Manual tokenization

#### 3.1. Load data

In [8]:
# Load data. Since small, it will load quickly and fit into memory, otherwisewrite code to memory map the file.
filename = 'Data/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

#### 3.2. Split into words by whitespace

In [20]:
# Split into words by whitespace: Using whitespace to split words
words = text.split()
print(words[:100])

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']


#### 3.3.  Select words: Using regex to split the document.

In [10]:
# Select words: Using regex to split the document into words by selecting strings of alphanumeric characters.
import re
# split based on words only
words = re.split(r'\W+', text)
print(words[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']


We notice that in the first method, 'armour-like' remains one word (we dont want this) and "What's" remains same (we may want this). In the second method, armour is split from like (we want) but what is split from s which changes its meaning. So lets try a different combination.

#### 3.4. Split by whitespace and remove punctuation.

In [13]:
# Split by whitespace and remove punctuation: Use the whitespace split method and then remove punctuation to keep contractions together.
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [15]:
# We can use re to select the punctuation characters and use the sub() to replace them with nothing.
re_punctuation = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word by substituting with ''
stripped = [re_punctuation.sub('', w) for w in words] # This is the whitespace removed words

In [16]:
print(stripped[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']


In [17]:
# Sometimes text data may contain non-printable characters, we can use a similar approach to filter out all 
# non-printable characters by selecting the inverse of the string.printable constant.
re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', w) for w in words] # This is the whitespace removed words
print(result[:100])

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']


#### 3.5. Normalizing case

In [21]:
# Convert to lower case. First split by whitespace. You can remove punctuation if need.
words = [word.lower() for word in words]
print(words[:100])

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']


**Note** Cleaning text is hard, problem specific and full of tradeoffs, Remember simple is better. Simpler text data, simpler models, smaller vocabularies.

#### 4. Tokenization and cleaning with NLTK

In [23]:
# Using the NLTK library
!pip3 install -U nltk
# Load and download nltk data for the library.
import nltk
nltk.download()

# or from/for command line
# !python3 -m nltk.downloader all

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.4.5.zip (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 429 kB/s eta 0:00:01
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Created wheel for nltk: filename=nltk-3.4.5-py3-none-any.whl size=1449907 sha256=a7aefe0d0130c96c7b364393a84087b0e0aa7e68841587b3670e21fde7d8771f
  Stored in directory: /home/michael/.cache/pip/wheels/48/8b/7f/473521e0c731c6566d631b281f323842bbda9bd819eb9a3ead
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.4.5
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### 4.1. Split into sentences

In [24]:
# Some modeling tasks prefer input to be in the form of paragraphs or sentences, such as Word2Vec.
# You could split into sentences, split into words and save each sentence to file.
from nltk import sent_tokenize
# Split into sentences
sentences = sent_tokenize(text)
print(sentences[0])

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.


Notice the wrapping still exists in the sentences even though they have been split.

#### 4.2. Split into words

In [25]:
# NLTK splits based on white space and punctuation.
from nltk.tokenize import word_tokenize
# split into words
tokens = word_tokenize(text)
print(tokens[:100])

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '.', '``', 'What', "'s", 'happened', 'to']


**Notice** Commas, periods an others are taken as seperate tokens. Contractions are also split. We can now decide what we want to filter out specifically.

#### 4.3. Filter out Punctuation

In [26]:
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']


Notice commas, periods were filtered out, as well as hyphenated descriptions like armour-like.

#### 4.4. Filter out Stop Words

If dealing with document classification, it may make more sense to remove stopwords. Note that this is problem specific.

In [27]:
from nltk.corpus import stopwords
# Select stopwords to use
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Notice they are lower case and punctuation has been removed. You can compare your tokens to the stopwords and filter them out.

**PS**
Now let's create a pipeline for the text preparation.

    1. Load the raw text.
    2. Split into tokens.
    3. Convert to lowercase.
    4. Remove punctuation.
    5. Filter out remaining tokens that are not alphabetic.
    6. Filter out tokens that are not stop words.

In [28]:
# Load modules 
import string, re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [29]:
# Load data
filename = 'Data/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

In [30]:
# split into words
tokens = word_tokenize(text)
# Convert to lower case
tokens = [w.lower() for w in tokens]
# Prepare regex for character filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# Remove punctuation from each word
stripped = [re_punc.sub('', w) for w in tokens]
# Remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# Filter out stop words
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words]
print(words[:100])

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'nt', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards', 'viewer']


**Note** There is still a lot more that can be done above. but for now we'll just wrap it all in a nice function for reuse purposes

In [32]:
def text_prep(text):
    tokens = word_tokenize(text)
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    stop_words = set(stopwords.words('english')) 
    tokens = [w.lower() for w in tokens]    
    stripped = [re_punc.sub('', w) for w in tokens]
    words = [word for word in stripped if word.isalpha()]    
    words = [w for w in words if w not in stop_words]
    return words

In [31]:
import time

In [33]:
start = time.time()
h = text_prep(text)
end = time.time()
print(end-start)

0.12821173667907715


#### 4.5. Stem Words

This is the process of reducing words to their root or base case. Document classification techniques should benefit from stemming in order to both reduce the vocabulary and to focus on the sense or sentiment of a document rather than deeper meaning.

In [34]:
# Load module
from nltk.stem.porter import PorterStemmer

In [35]:
# split words
tokens = word_tokenize(text)
# Stem words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', '``', 'what', "'s", 'happen', 'to']


**Note** There are other algorithms and tools in NLTK for stemming and lemmatization if you need to explore further.

#### 5. Additional Text Cleaning Considerations

The data used here was sort of already cleaned, remember we also had to do some manual cleaning to our data before using it. Your own data problems may be more, however, here are some considerations when cleaning text.

    1. Handling large documents and large collections of text documents that do not fit into memory.
    2. Extracting text from markup like HTML, PDF, or other structured document formats.
    3. Transliteration of characters from other languages into English.
    4. Decoding Unicode characters into a normalized form such as UTF8.
    5. Handling of domain specific words, phrases, and acronyms.
    6. Handling or removing numbers, such as dates and amounts.
    7. Locating and correcting common typos and misspellings.
    8. Much more...

We can see thus that getting truly clean text is impossible, what we are really hope to achieve is the best we can given time, resources and the knowledge we have.

The idea of *clean* is defined by the specific task or concern of your project.

## Now let's see how to Prepare Text Data with scikit-learn

The process of data preparation involves getting the data in a format that a deep learning or machine learning model can interprete. Thus after cleaning i.e removing punctuations, creating tokens, etc., we have encode the words as integers or floating point values for use as input to an algorithm. This process is called feature extraction or vectorization.

#### BOW - Bag of Words Model.

This model is simple in that it throws away all other information in the words and focuses on the occurence of words in a document. This can be done by assigning each word a unique number.
Any other document we then see can be encoded as a fixed-length vetor with the length of the vocabulary of known words.

The three methods below are various bag of words methods we can use.

#### 1. CountVectorizer

In [36]:
# Import modules
from sklearn.feature_extraction.text import CountVectorizer

In [37]:
# Create a sample text
text = ["The quick brown fox jumped over the lazy dog."]
# Create the countvectorizer instance
vectorizer = CountVectorizer()
# Tokenize and build vocabulary
vectorizer.fit(text)
# Summarize
print(vectorizer.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [39]:
# Encode document
vectors = vectorizer.transform(text)
# Summarize encoded vector
print(vectors.shape)
print(type(vectors))
print(vectors.toarray())

(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


We can use the above vectors on a new text, but note, words not included in the first vocab will have no count thus resulting in 0 being given by the transform for these words.

In [40]:
# Encode another doc
text2 = ["the puppy"]
vector2 = vectorizer.transform(text2)
print(vector2.toarray())

[[0 0 0 0 0 0 0 1]]


This is called Word count vectorization.

#### 2. TfidfVectorizer

Word counts are a good starting point but very basic. One issue is some words like *the* will appear many times and their large counts will not be very meaningful in the encoded vectors.

**Term Frequency:** This summarizes how often a given word appears within a document.
    
**Inverse Document Frequency:** This downscales words that appear a lot across documents.

Tfidf are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document, but not across documents.

The Tfidfvectorizer will tokenize documents, learn the vocab and inverse document frequency weightings, and allow you to encode new documents.

In [41]:
# import module
from sklearn.feature_extraction.text import TfidfVectorizer
# List of text documents
text = ["The quick brown fox jumped over the lazy dog.", 
        "The dog.", 
        "The fox"]
# Create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# Summarize
print(vectorizer.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [42]:
print(vectorizer.idf_)

[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


In [44]:
# Encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


The inverse document is calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word, *the*.

The final scorings for words in the vocab are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most ML algorithms.

#### 3. HashingVectorizer

A limitiation to the above two methods is, vocabulary can become very large thus requiring large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

A work around would be to use a one way hash of words to convert them to integers. What makes it clever is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. The only downside is that the hash is a one-way finction so there is no way to convert the encoding back to a word, this doesn't matter for most ML tasks.

In [45]:
# import modules 
from sklearn.feature_extraction.text import HashingVectorizer
# List of text documents
text = ["The quick fox jumped over the lazy dog."]
# Create the transform
vectorizer = HashingVectorizer(n_features = 20) # 20 is the arbitrary fixed length size we are choosing. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions.
# There are heuristics you can use to pick a hash length and probability of collision based on an estimated vocabulary size (e.g. a load factor of 75%)

In [46]:
# Encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 20)
[[ 0.          0.          0.          0.          0.          0.35355339
   0.         -0.35355339  0.          0.          0.          0.35355339
   0.          0.          0.         -0.35355339  0.          0.
  -0.70710678  0.        ]]


## How to Prepare Text Data with Keras

#### 1. Split words with text_to_word_sequence

This function does these three things:
    
    1. Splits words by space.
    2. Filters out punctuation.
    3. Converts text to lowercase (lower=True).

In [48]:
# import modules
from keras.preprocessing.text import text_to_word_sequence
# Define the document
text = 'The quick brown fox jumped over the lazy dog'
# Tokenize the document
result = text_to_word_sequence(text)
print(result)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


#### 2. Encoding with one_hot

In [65]:
# Define the doc
text = 'The quick brown fox jumped over the lazy dog.'
# Estimate the size of the vocab
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

8


In [66]:
# Import module
from keras.preprocessing.text import one_hot
# Integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

[5, 6, 5, 8, 2, 6, 5, 8, 5]


#### 3. Hash Encoding with hashing_trick

As seen above, to avoid keeping track of a vocabulary, we can use the one-way hash function. This is faster and requires less memory.

This method allows you specify the hash function as either hash or other hash functions such as the built in md5 function or your own.

In [67]:
# Import modules
from keras.preprocessing.text import hashing_trick
# define text
print(text)

The quick brown fox jumped over the lazy dog.


In [68]:
# Estimate the size of the vocab
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

8


In [69]:
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

[6, 4, 1, 2, 7, 5, 6, 2, 6]


#### 4. Tokenizer API

The tokenizer API wraps most of all the above and allows you just parse your data and have it fixed. Can be reused and is encouraged for large projects.

In [70]:
# Import modules
from keras.preprocessing.text import Tokenizer
# Define documents
docs = ['Well done!', 
        'Good work', 
        'Great effort', 
        'Nice work', 
        'Excellent!']
# Create the tokenizer
tokenz = Tokenizer()
# Fit the tokenizer on the document
tokenz.fit_on_texts(docs)

The tokenizer, once fit, provides 4 attributes that can be used to query what has been learned about the documents.

    1. word_counts: A dictionary of words and their counts.
    2. word_docs: An integer count of the total number of documents that were used to fit the Tokenizer.
    3. word_index: A dictionary of words and their uniquely assigned integers.
    4. document_count: A dictionary of words and how many documents each appeared in.

In [71]:
# Summarize learning
print(tokenz.word_counts)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])


In [72]:
print(tokenz.word_docs)

defaultdict(<class 'int'>, {'well': 1, 'done': 1, 'good': 1, 'work': 2, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})


In [73]:
print(tokenz.word_index)

{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}


In [74]:
print(tokenz.document_count)

5


Once the Tokenizer has been fit on training data, it can be used to encode documents in the train or test datasets.

Models included in the texts_to_matrix() function on the tokenizer include:
    
    1. binary: Whether or not each word is present in the doc. Default.
    2. count: The count of each word in the doc.
    3. tfidf: The text frequency-inverse document frequency (TF-IDF) scoring for each word in the doc.
    4. freq: The frequency of each word as a ratio of words within each doc.

In [75]:
# Integer encode documents
encoded_docs = tokenz.texts_to_matrix(docs, mode='count')
print(encoded_docs)

[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]


We will use this method to prepare text for word embeddings as we progress.