### 1. Common text pre-processing examples
In this section, we will do some general purpose text cleaning. The following methods for cleaning can be extended depending on the application.

In [47]:
text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "
print(text)

   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


Let's first lowercase our text. 

In [48]:
text = text.lower()
print(text)

   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


We can get rid of leading/trailing whitespace with the following:

In [49]:
text = text.strip()
print(text)

this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .


Remove HTML tags/markups:

In [50]:
import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

striphtml('this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs')

'this is a message to be cleaned. it may involve some things like: , ?, :,   adjacent spaces and tabs'

Replace punctuation with space. Depending on the application, punctuations can actually be useful. For example positive vs negative meanining of a sentence.

In [51]:
import re,string

text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ',text)
print(text)

this is a message to be cleaned  it may involve some things like   br             adjacent spaces and tabs      


Remove extra space and tabs

In [52]:
text = re.sub('\s+', " ", text)
print(text)

this is a message to be cleaned it may involve some things like br adjacent spaces and tabs 


### 2. Lexicon-based text processing examples
We saw some general purpose text pre-processing methods in the previous section. Lexicon based methods are usually applied after the common text processing methods. They are used to normalize sentences in our dataset. By normalization, here, we mean putting words into a similar format that will also enhace similarities (if any) between sentences.

In [53]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\solharsh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\solharsh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\solharsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Stop word removal

There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: "a", "an", "the", "this", "that", "is", "it", "to", "and" in this example.


In [55]:
import nltk
from nltk.tokenize import word_tokenize
filtered_sentence = []
# Stop word lists can be adjusted for out problem
stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]
#Tokenize the sentence
words = word_tokenize(text)
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
        
text = " ".join(filtered_sentence)
print(text)

message be cleaned may involve some things like br adjacent spaces tabs


#### Stemming
Stemming is a rule-based system to convert words into their root form. It removes suffixes from words. This helps us enhace similarities (if any) between sentences. 

Example:

"jumping", "jumped" -> "jump"

"cars" -> "car"

In [56]:
from nltk.stem import SnowballStemmer

# Initialize the stemmer
snow = SnowballStemmer("english")

stemmed_sentence = []
#Tokenize
words = word_tokenize(text)
for w in words:
    #Stem the word//token
    stemmed_sentence.append(snow.stem(w))
stemmed_text = " ".join(stemmed_sentence)
print(stemmed_text)


messag be clean may involv some thing like br adjac space tab


We can see above that stemming operation is NOT perfect. We have mistakes such as "messag", "involv", "adjac". It is a rule based method that sometimes mistakely remove suffixes from words. Nevertheless, it runs fast.

#### Lemmatization
If we are not satisfied with the result of stemming, we can use the Lemmatization instead. It usually requires more work, but gives better results. Lemmatization needs to know the correct word position tags such as "noun", "verb", "adjective", etc. and we will use another NLTK function to feed this information to the lemmatizer.

In [59]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

In [62]:
#Initialize
wl = WordNetLemmatizer()
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

lemmatized_sentence = []
# Tokenize the sentence
words = word_tokenize(text)
# Get position tags
word_pos_tags = nltk.pos_tag(words)
# Map the position tag and lemmatize the word/token
for idx, tag in enumerate(word_pos_tags):
    lemmatized_sentence.append(wl.lemmatize(tag[0], get_wordnet_pos(tag[1])))

lemmatized_text = " ".join(lemmatized_sentence)

In [63]:
print(lemmatized_text)

message be clean may involve some thing like br adjacent space tabs


This looks better than the stemming result.