# Text Preprocessing

In this notebook we explore techniques to clean and convert text features into numerical features that machine learning algorithms can work with. We will implement and explore the following.

1. Common text pre-processing
2. Lexicon based text processing
3. Feature Extraction - Bag of words
4. Putting it all together

# 1. Common text pre-processing

In this section, we will do some general purpose text cleaning.

In [1]:
text = "This is a message to be cleaned. It may involve some things like: <br>, ?, :, ', '' adjacent spaces and tabs"

In [2]:
# let's first lowercase our text completely.

text = text.lower()
print(text)

this is a message to be cleaned. it may involve some things like: <br>, ?, :, ', '' adjacent spaces and tabs


In [3]:
# Now lets get rid of leading/trailing whitespaces
# Note - In order to remove it just from from the left or right
# side we can do lstrip and rstrip. But in this case we do

text = text.strip()
print(text)

this is a message to be cleaned. it may involve some things like: <br>, ?, :, ', '' adjacent spaces and tabs


In [11]:
import re, string

text = re.compile('<.*?>').sub('', text)
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
print(text)

this is a message to be cleaned it may involve some things like  br          adjacent spaces and tabs


In [12]:
# Remove extra spaces and tabs

import re

text = re.sub('\s+',' ', text)
print(text)

this is a message to be cleaned it may involve some things like br adjacent spaces and tabs


# 2. Lexicon based-tree processing

We saw some general purpose text pre-processing methods. Lexicon methods are usually used to normalize sentences in our dataset and later we will use these normalized sentences for feature extraction.
By normalizing here we mean putting words in the sentences into a similar format that will enhance similarities (if any) between sentences.

Stop word removal : There can be some words in our sentences that occur very frequently and dont contribute too much to the overall meaning of the sentences. 

In [15]:
stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and", "br"]

filtered_sentence = []
words = text.split(" ")
print(words)
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
text = " ".join(filtered_sentence)
print(text)


['message', 'be', 'cleaned', 'may', 'involve', 'some', 'things', 'like', 'br', 'adjacent', 'spaces', 'tabs']
message be cleaned may involve some things like adjacent spaces tabs


<b>Stemming</b> : Stemming is a rule based system to <b>convert words into their root form</b>.
It removes suffixes from words. This helps us enhance similarities (if any) between sentences.

Example:

"jumping", "jumped" -> "jump"
"cars" -> "car"

In [17]:
# we use the nltk library for stemming (Natural Language Toolkit)

import nltk
from nltk.stem import SnowballStemmer
# Initialize the stemmer
ss = SnowballStemmer("english")
stemmed_sentence = []
words = text.split(" ")
for w in words:
    stemmed_sentence.append(ss.stem(w))
text = " ".join(stemmed_sentence)
print(text)

messag be clean may involv some thing like adjac space tab


# 3. Feature Extraction - Bag of Words

So the method is quite simple. FIrst we apply some common pre-processing methods and then we apply some lexicon based trasnformations. After those we will convert our text data into numerical data with the Bag of Words (BoW) representation.

Bag of Words (BoW) : A modeling technique to convert text information into numerical representation.
Machine Learning models expect numerical or categorical alues as input and won't work with raw text data.

Steps :
1. Create vocabulary of known words.
2. Measure presence of the known words in sentences.

We will use sklearn library's Bag of Words implementation.

from sklearn.feature_extraction.text import CountVectorizer

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

countVectorizer = CountVectorizer(binary=True)

sentences = [
    "This is the first document",
    "This is the second document",
    "And the third one",
    "Is this the first document"
]

X = countVectorizer.fit_transform(sentences)

print(countVectorizer.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


Each number next to a word represents its index in the vocabulary (from 0 to 8 in this case)

Note : sklearn automatically removes punctuation, but doesnt do the other extra pre-processing methods we discussed.
Lexicon-based methods are also not automatically applied, we need to call those methods before feature extraction.

In [20]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 1 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


So, what happens when we encounter a new word during prediction?

New words will be skipped.
This usually happens when we are making predictions, for our text and validation data/text, we need to use the .transform() function this time.
This stimulates a real-time prediction case when we cannot re-train the model quickly whenever we receive new words.

In [21]:
test_sentences = [
    "this document has some new words",
    "this one is new too"
]

count_vectors = countVectorizer.transform(test_sentences)
print(count_vectors.toarray())

[[0 1 0 0 0 0 0 0 1]
 [0 0 0 1 1 0 0 0 1]]


# 4. Putting it all together

Let's have a full example here. We will apply everything discussed earlier.

In [22]:
# prepare cleaning functions

import re, string
import nltk
from nltk.stem import SnowballStemmer

stop_words = ["a", "an", "the", "this", "that", "is", "it", "to","and"]
stemmer = SnowballStemmer('english')

def preProcessText(text):
    
    # lowercase and strip leading/training white space
    text = text.lower().strip()
    
    # remove HTML tags
    text = re.compile('<.*?>').sub('', text)
    
    # remove punctuations
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
    
    # remove extra white spaces
    text = re.sub('\s+', ' ', text)
    
    return text

def lexiconProcess(text, stop_words, stemmer):
    filtered_sentence = []
    words = text.split(" ")
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(stemmer.stem(w))
    text = " ".join(filtered_sentence)
    
    return text

def cleanSentence(text, stop_words, stemmer):
    return lexiconProcess(preProcessText(text), stop_words, stemmer)


In [24]:
# prepare vectorizer

from sklearn.feature_extraction.text import CountVectorizer

textVectorizer = CountVectorizer(binary=True) # can also limit
#vocabulary size here, with say, max_features=50


In [32]:
# clean and vectorize a text feature with four samples

text_feature = [
    "I liked the material, color and overall how it looks.<br /><br />",
    "Worked okay fist two times I used it, but third time burned his face",
    "I am not sure about this product",
    "I never thought I would pay so much for a hair dryer"

]
print(len(text_feature))

# clean up the text

text_feature_cleaned = [cleanSentence(
    item, stop_words, stemmer) for item in text_feature]
print(text_feature_cleaned)

# Vectorize the cleaned text

text_feature_vectorized = textVectorizer.fit_transform(text_feature_cleaned)

print('Vocabulary:\n', textVectorizer.vocabulary_)
print('Bag of words Binary features:\n', text_feature_vectorized.toarray())
print(text_feature_vectorized.shape)

4
['i like materi color overal how look ', 'work okay fist two time i use but third time burn his face', 'i am not sure about product', 'i never thought i would pay so much for hair dryer']
Vocabulary:
 {'like': 12, 'materi': 14, 'color': 4, 'overal': 19, 'how': 11, 'look': 13, 'work': 29, 'okay': 18, 'fist': 7, 'two': 27, 'time': 26, 'use': 28, 'but': 3, 'third': 24, 'burn': 2, 'his': 10, 'face': 6, 'am': 1, 'not': 17, 'sure': 23, 'about': 0, 'product': 21, 'never': 16, 'thought': 25, 'would': 30, 'pay': 20, 'so': 22, 'much': 15, 'for': 8, 'hair': 9, 'dryer': 5}
Bag of words Binary features:
 [[0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0]
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1]]
(4, 31)
