# Text preprocessing

📌 `Natural Language Toolkit (NLTK)` is one of the largest Python libraries for performing various Natural Language Processing tasks.
 
From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text.

In [1]:
#Loading NLTK
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

📌 Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into sentences and sentences into words. It is one of the initial steps of any NLP pipeline.

Let us take a look at the two major types of tokenization provided by NLTK, along with one manual method(character tokenization):.

### Sentence Tokenizing

In [2]:
from nltk.tokenize import sent_tokenize

In [3]:
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""

In [4]:
tokenized_text = sent_tokenize(text)
print(tokenized_text)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]


### Word Tokenizing

In [5]:
from nltk.tokenize import word_tokenize

In [6]:
tokenized_word=word_tokenize(text)
print(tokenized_word)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']


### Character Tokenizing

In [7]:
tokenized_character = []

In [8]:
for character in text:
    tokenized_character.append(character)

In [9]:
print(tokenized_character)

['H', 'e', 'l', 'l', 'o', ' ', 'M', 'r', '.', ' ', 'S', 'm', 'i', 't', 'h', ',', ' ', 'h', 'o', 'w', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', ' ', 'd', 'o', 'i', 'n', 'g', ' ', 't', 'o', 'd', 'a', 'y', '?', ' ', 'T', 'h', 'e', ' ', 'w', 'e', 'a', 't', 'h', 'e', 'r', ' ', 'i', 's', ' ', 'g', 'r', 'e', 'a', 't', ',', ' ', 'a', 'n', 'd', ' ', 'c', 'i', 't', 'y', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '.', '\n', 'T', 'h', 'e', ' ', 's', 'k', 'y', ' ', 'i', 's', ' ', 'p', 'i', 'n', 'k', 'i', 's', 'h', '-', 'b', 'l', 'u', 'e', '.', ' ', 'Y', 'o', 'u', ' ', 's', 'h', 'o', 'u', 'l', 'd', 'n', "'", 't', ' ', 'e', 'a', 't', ' ', 'c', 'a', 'r', 'd', 'b', 'o', 'a', 'r', 'd']


### Stopwords

📌 Stop words are common words in a language that are often filtered out during nlp tasks because they carry little meaning or contribute minimally to the overall understanding of a text. Examples include "the," "is," "and," "in," etc.

In [10]:
from nltk.corpus import stopwords

In [11]:
stopwords.words("english")
print(type(stopwords.words("english")))
print(len(stopwords.words("english")))

<class 'list'>
198


In [12]:
stop_words = set(stopwords.words("english"))
print(len(stop_words))
print(stop_words)

198
{'and', 'needn', 'won', 'couldn', 'here', "she'll", 'below', 'up', "it's", 'this', "it'll", 'some', 't', 'very', 'yourself', 'no', 'or', 'just', 'can', 'on', 'who', 'when', 'their', 'few', 'mustn', "don't", "i'll", 'didn', "it'd", "isn't", "she's", 'off', 'so', "wasn't", 'ain', 'a', 'than', 'was', 'my', 'were', 'what', "aren't", 'having', 'isn', 'too', 'we', 'more', 'do', "he's", "we'll", "you'd", 'nor', 'other', 'been', 's', 'same', "that'll", 'down', 'in', 'ma', 'should', 'whom', "i've", 'wasn', "mightn't", 'don', 'any', 've', "didn't", 'yourselves', 'your', 'ourselves', "should've", 'am', 'as', 'shan', 'such', 'i', 'again', 'himself', 'during', 'own', "wouldn't", 'yours', 'you', 'after', 'most', 'at', "they'll", 'between', "doesn't", 'shouldn', "i'd", "hasn't", "you've", 'hadn', 'itself', 'him', 'that', "you're", 'being', 'while', 'myself', 'weren', 'from', 'her', "he'll", 'above', 'wouldn', 'then', 'be', "you'll", 'ours', 'are', 'each', 'his', 'aren', 'me', 'our', 'm', 'until',

### Removing Stopwords

In [13]:
filtered_sent=[]

In [14]:
tokenized_word

['Hello',
 'Mr.',
 'Smith',
 ',',
 'how',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'The',
 'weather',
 'is',
 'great',
 ',',
 'and',
 'city',
 'is',
 'awesome',
 '.',
 'The',
 'sky',
 'is',
 'pinkish-blue',
 '.',
 'You',
 'should',
 "n't",
 'eat',
 'cardboard']

In [15]:
for w in tokenized_word:
    if w not in stop_words:
        filtered_sent.append(w)

In [16]:
print("Tokenized Sentence:", tokenized_word)
print("Tokenized Sentence length:", len(tokenized_word))
print("Filterd Sentence:" , filtered_sent)
print("Filterd Sentence length:", len(filtered_sent))

Tokenized Sentence: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']
Tokenized Sentence length: 30
Filterd Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', "n't", 'eat', 'cardboard']
Filterd Sentence length: 21


### Stemming

📌 Stemming generates the base word from the inflected word by removing the affixes of the word. It has a set of pre-defined rules that govern the dropping of these affixes.

📌 It must be noted that stemmers might not always result in semantically meaningful base words.  Stemmers are faster and computationally less expensive than lemmatizers.

In [17]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

In [18]:
ps = PorterStemmer()

In [19]:
stemmed_words=[]

In [20]:
for w in filtered_sent:
    stemmed_words.append(ps.stem(w))

In [21]:
print("Filtered Sentence:", filtered_sent)
print("Stemmed Sentence:", stemmed_words)

Filtered Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', "n't", 'eat', 'cardboard']
Stemmed Sentence: ['hello', 'mr.', 'smith', ',', 'today', '?', 'the', 'weather', 'great', ',', 'citi', 'awesom', '.', 'the', 'sky', 'pinkish-blu', '.', 'you', "n't", 'eat', 'cardboard']


### Lemmatization

📌 Lemmatization involves grouping together the inflected forms of the same word. This way, we can reach out to the base form of any word which will be meaningful in nature. The base form here is called the Lemma.

📌 Lemmatizers are slower and computationally more expensive than stemmers.

In [22]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [23]:
lem = WordNetLemmatizer()

In [24]:
stem = PorterStemmer()

In [25]:
word = "is"

In [26]:
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

Lemmatized Word: be
Stemmed Word: is


In [27]:
lemmatize_words = []
stem_words = []

In [28]:
for word in tokenized_word:
    lemmatize_words.append(lem.lemmatize(word, 'v'))
    stem_words.append(stem.stem(word))

In [29]:
lemmatize_words

['Hello',
 'Mr.',
 'Smith',
 ',',
 'how',
 'be',
 'you',
 'do',
 'today',
 '?',
 'The',
 'weather',
 'be',
 'great',
 ',',
 'and',
 'city',
 'be',
 'awesome',
 '.',
 'The',
 'sky',
 'be',
 'pinkish-blue',
 '.',
 'You',
 'should',
 "n't",
 'eat',
 'cardboard']

In [30]:
stem_words

['hello',
 'mr.',
 'smith',
 ',',
 'how',
 'are',
 'you',
 'do',
 'today',
 '?',
 'the',
 'weather',
 'is',
 'great',
 ',',
 'and',
 'citi',
 'is',
 'awesom',
 '.',
 'the',
 'sky',
 'is',
 'pinkish-blu',
 '.',
 'you',
 'should',
 "n't",
 'eat',
 'cardboard']

In [31]:
dict = {}

for index in range(len(tokenized_word)):
    dict[stem_words[index]] = lemmatize_words[index]

dict

{'hello': 'Hello',
 'mr.': 'Mr.',
 'smith': 'Smith',
 ',': ',',
 'how': 'how',
 'are': 'be',
 'you': 'You',
 'do': 'do',
 'today': 'today',
 '?': '?',
 'the': 'The',
 'weather': 'weather',
 'is': 'be',
 'great': 'great',
 'and': 'and',
 'citi': 'city',
 'awesom': 'awesome',
 '.': '.',
 'sky': 'sky',
 'pinkish-blu': 'pinkish-blue',
 'should': 'should',
 "n't": "n't",
 'eat': 'eat',
 'cardboard': 'cardboard'}

### POS Tagging

📌 Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech. It is significant as it helps to give a better syntactic overview of a sentence.

In [33]:
sentence = "Albert Einstein was born in Ulm, Germany in 1879."

In [34]:
tokens = nltk.word_tokenize(sentence)
print(tokens)

['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.']


In [35]:
nltk.pos_tag(tokens)

[('Albert', 'NNP'),
 ('Einstein', 'NNP'),
 ('was', 'VBD'),
 ('born', 'VBN'),
 ('in', 'IN'),
 ('Ulm', 'NNP'),
 (',', ','),
 ('Germany', 'NNP'),
 ('in', 'IN'),
 ('1879', 'CD'),
 ('.', '.')]

In [36]:
import pandas as pd

In [None]:
POS_tags = pd.read_csv(r'Data\POS_tags.csv')
POS_tags

Unnamed: 0,Tag,Meaning
0,CC,Coordinating conjunction
1,CD,Cardinal number
2,DT,Determiner
3,EX,Existential there
4,FW,Foreign word
5,IN,Preposition/subordinating conjunction
6,JJ,Adjective
7,JJR,"Adjective, comparative"
8,JJS,"Adjective, superlative"
9,LS,List item marker
