## Text Pre-processing using nltk and pyspellchecker

This notebook contains code for text preprocessing using NLTK and spellchecker.

#### Steps for Text Pre-Processing
 0. Split sentences
 1. Tokenize sentences
 2. Spell Check
 3. Part of Speech Tagging or Pos Tagging
 4. Stop-words removal
 5. Lowercase
 6. Non Alpha/ Alpha Numeric characters removal
 7. Stemming and Lemmatization

NLTK is standard and one of the most popular library in python for text data
. And pyspellchecker is simple easy to use library for spellcheck.

### Setup
 Below are installation instruction of libraries requeried for this notebook

>* `!pip install pandas`
* `!pip install pyspellchecker`
* `!pip  install nltk`
* `nltk.download('punkt')`
* `nltk.download('averaged_perceptron_tagger')`
* `nltk.download('stopwords')`
* `nltk.download('wordnet')`

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import pandas as pd

# nltk imports used in preprocessing
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

# import for spellcheck
from spellchecker import SpellChecker

### Creating sample data

In [16]:
text_data = ["i want to travel. Hence book my tickets.",
            "please book flight tickets for tomorow morning.",
            "the shwo was vry good. but it was very costly.",
            "let's complete wrk next weekend. So that we can have party afterwards.",
            "my work is allmost complted"]

### Split sentences

In [3]:
text_data = sum([sent_tokenize(text) for text in text_data],[])
text_data

['i want to travel.',
 'Hence book my tickets.',
 'please book flight tickets for tomorow morning.',
 'the shwo was vry good.',
 'but it was very costly.',
 "let's complete wrk next weekend.",
 'So that we can have party afterwards.',
 'my work is allmost complted']

### Tokenize sentences to words

In [4]:
sentences = [word_tokenize(text) for text in text_data]
print(sentences)

[['i', 'want', 'to', 'travel', '.'], ['Hence', 'book', 'my', 'tickets', '.'], ['please', 'book', 'flight', 'tickets', 'for', 'tomorow', 'morning', '.'], ['the', 'shwo', 'was', 'vry', 'good', '.'], ['but', 'it', 'was', 'very', 'costly', '.'], ['let', "'s", 'complete', 'wrk', 'next', 'weekend', '.'], ['So', 'that', 'we', 'can', 'have', 'party', 'afterwards', '.'], ['my', 'work', 'is', 'allmost', 'complted']]


### Spellcheck

In [5]:
# initializing pyspellchecker SpellChecker obj
spell = SpellChecker()

# finidng out misspelled words
misspelled = spell.unknown(sum(sentences,[]))

# finding correct word and storing in dictionary
misspelled_dict = dict((word,spell.correction(word)) for word in misspelled)
misspelled_dict

{'allmost': 'almost',
 'shwo': 'show',
 "'s": 'is',
 'vry': 'very',
 'wrk': 'work',
 'complted': 'completed',
 'tomorow': 'tomorrow'}

In [6]:
# replacing spelling mistakes in all sentences
for word in misspelled_dict.keys():
    for sentence in sentences:
        try:
            sentence[sentence.index(word)] = misspelled_dict[word]
        except ValueError:
            pass
sentences

[['i', 'want', 'to', 'travel', '.'],
 ['Hence', 'book', 'my', 'tickets', '.'],
 ['please', 'book', 'flight', 'tickets', 'for', 'tomorrow', 'morning', '.'],
 ['the', 'show', 'was', 'very', 'good', '.'],
 ['but', 'it', 'was', 'very', 'costly', '.'],
 ['let', 'is', 'complete', 'work', 'next', 'weekend', '.'],
 ['So', 'that', 'we', 'can', 'have', 'party', 'afterwards', '.'],
 ['my', 'work', 'is', 'almost', 'completed']]

### Part of Speech Tagging or Pos Tagging

In [7]:
pos_tagged = [nltk.pos_tag(sentence) for sentence in sentences]
print(pos_tagged)

[[('i', 'NN'), ('want', 'VBP'), ('to', 'TO'), ('travel', 'VB'), ('.', '.')], [('Hence', 'NNP'), ('book', 'NN'), ('my', 'PRP$'), ('tickets', 'NNS'), ('.', '.')], [('please', 'VB'), ('book', 'NN'), ('flight', 'NN'), ('tickets', 'NNS'), ('for', 'IN'), ('tomorrow', 'NN'), ('morning', 'NN'), ('.', '.')], [('the', 'DT'), ('show', 'NN'), ('was', 'VBD'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')], [('but', 'CC'), ('it', 'PRP'), ('was', 'VBD'), ('very', 'RB'), ('costly', 'JJ'), ('.', '.')], [('let', 'NN'), ('is', 'VBZ'), ('complete', 'JJ'), ('work', 'NN'), ('next', 'JJ'), ('weekend', 'NN'), ('.', '.')], [('So', 'RB'), ('that', 'IN'), ('we', 'PRP'), ('can', 'MD'), ('have', 'VB'), ('party', 'NN'), ('afterwards', 'NNS'), ('.', '.')], [('my', 'PRP$'), ('work', 'NN'), ('is', 'VBZ'), ('almost', 'RB'), ('completed', 'VBN')]]


### Stop words removal

In [8]:
# nltk comes with a predefined list of stopwords,
# we can add or remove words from it according to out application need
# many a times we do not want to remove few words like wh-question words
# And other words like 'not' as they may result in information loss

words_not_to_reomve = ["not", "where", "why", "how", "what", "who", "which", "when", "whom"]
stopwords_list = [words for words in stopwords.words('english') if words not in words_not_to_reomve]

sentences_without_stopwords = sentences.copy()

for sentence in sentences_without_stopwords:
    intersection = set(sentence).intersection(set(stopwords_list))
    if intersection:
        for w in intersection:
            sentence.remove(w)
            
sentences_without_stopwords

[['want', 'travel', '.'],
 ['Hence', 'book', 'tickets', '.'],
 ['please', 'book', 'flight', 'tickets', 'tomorrow', 'morning', '.'],
 ['show', 'good', '.'],
 ['costly', '.'],
 ['let', 'complete', 'work', 'next', 'weekend', '.'],
 ['So', 'party', 'afterwards', '.'],
 ['work', 'almost', 'completed']]

### Lowercase sentences

In [9]:
sentences_lowercase = [[words.lower() for words in sent] for sent in sentences]
sentences_lowercase

[['want', 'travel', '.'],
 ['hence', 'book', 'tickets', '.'],
 ['please', 'book', 'flight', 'tickets', 'tomorrow', 'morning', '.'],
 ['show', 'good', '.'],
 ['costly', '.'],
 ['let', 'complete', 'work', 'next', 'weekend', '.'],
 ['so', 'party', 'afterwards', '.'],
 ['work', 'almost', 'completed']]

### Non Alpha/ Alpha Numeric characters removal

In [10]:
# Keeping only alpha numeric chatracters
# if you want numbers too use isalnum() instead of .isalpha()

sentences_alpha = [[word for word in sent if word.isalpha()] for sent in sentences]
sentences_alpha

[['want', 'travel'],
 ['Hence', 'book', 'tickets'],
 ['please', 'book', 'flight', 'tickets', 'tomorrow', 'morning'],
 ['show', 'good'],
 ['costly'],
 ['let', 'complete', 'work', 'next', 'weekend'],
 ['So', 'party', 'afterwards'],
 ['work', 'almost', 'completed']]

### Stemming and lemmatization

Here is the definition from wikipedia for stemming and lemmatization:

> Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form

> Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

In [11]:
# Stemming Example

ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
[ps.stem(w) for w in example_words]

['python', 'python', 'python', 'python', 'pythonli']

In [12]:
# Lematizing Example

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("was", pos="v"))
print(lemmatizer.lemmatize("best", pos="a"))

be
best


In [13]:
sentences_stemed = [[ps.stem(word) for word in sent] for sent in sentences_alpha]
sentences_stemed

[['want', 'travel'],
 ['henc', 'book', 'ticket'],
 ['pleas', 'book', 'flight', 'ticket', 'tomorrow', 'morn'],
 ['show', 'good'],
 ['costli'],
 ['let', 'complet', 'work', 'next', 'weekend'],
 ['So', 'parti', 'afterward'],
 ['work', 'almost', 'complet']]

In [14]:
sentences_lematized = [[lemmatizer.lemmatize(word) for word in sent] for sent in sentences_stemed]
sentences_lematized

[['want', 'travel'],
 ['henc', 'book', 'ticket'],
 ['plea', 'book', 'flight', 'ticket', 'tomorrow', 'morn'],
 ['show', 'good'],
 ['costli'],
 ['let', 'complet', 'work', 'next', 'weekend'],
 ['So', 'parti', 'afterward'],
 ['work', 'almost', 'complet']]