# Text Preprocessing

Supose we have textual data available, we need to apply many of pre-processing steps to the data to transform those words into numerical features that work with machine learning algorithms.

The pre-processing steps for the problem depend mainly on the domain and the problem itself.We don't need to apply all the steps for every problem.

Here, we're going to see text preprocessing in Python. We'll use NLTK(Natural language toolkit) library here.

In [12]:
# import necessary libraries 
import nltk
import string
import re

### Text lowercase

We do lowercase the text to reduce the size of the vocabulary of our text data.

In [13]:
def lowercase_text(text): 
    return text.lower() 
  
input_str = "Weather is too Cloudy.Possiblity of Rain is High,Today!!"
lowercase_text(input_str) 

'weather is too cloudy.possiblity of rain is high,today!!'

### Remove numbers

We should either remove the numbers or convert those numbers into textual representations.
We use regular expressions(re) to remove the numbers.

In [14]:
# For Removing numbers 
def remove_num(text): 
    result = re.sub(r'\d+', '', text) 
    return result 
  
input_s = "You bought 6 candies from shop, and 4 candies are in home."
remove_num(input_s) 

'You bought  candies from shop, and  candies are in home.'

## Convert the numbers into words

In [28]:
import inflect 
q = inflect.engine() 
def convert_num(text): 
    temp_string = text.split() 
    new_str = [] 
    for word in temp_string: 
        if word.isdigit(): 
            temp = q.number_to_words(word) 
            new_str.append(temp) 
        else: 
            new_str.append(word) 
    temp_str = ' '.join(new_str) 
    return temp_str 
  
input1 = 'I am 20 years old'
print(convert_num(input1))
input2 = 'I was born in 2002'
print(convert_num(input2))

I am twenty years old
I was born in two thousand and two


### Remove Punctuation

We remove punctuations because of that we don't have different form of the same word. If we don't remove punctuations, then been, been, and been! will be treated separately.

In [29]:
# let's remove punctuation 
def rem_punct(text): 
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator) 
  
input_str = "Hey, Are you excited??, After a week, we will be in Shimla!!!"
rem_punct(input_str) 

'Hey Are you excited After a week we will be in Shimla'

### Remove default stopwords:

Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [17]:
# importing nltk library
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

nltk.download('stopwords')
nltk.download('punkt')
  
# remove stopwords function 
def rem_stopwords(text): 
    stop_words = set(stopwords.words("english")) 
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text 
  
ex_text = "Data is the new oil. A.I is the last invention"
rem_stopwords(ex_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention']

### Stemming

From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, stemming a word may not result in actual words.

For Example: Mangoes ---> Mango

             Boys ---> Boy
             
             going ---> go
             
             
If our sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into their root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster Stemmer. We usually use Porter stemmer among them.

In [18]:
#importing nltk's porter stemmer 
from nltk.stem.porter import PorterStemmer 
from nltk.tokenize import word_tokenize 
stem1 = PorterStemmer() 
  
# stem words in the list of tokenised words 
def s_words(text): 
    word_tokens = word_tokenize(text) 
    stems = [stem1.stem(word) for word in word_tokens] 
    return stems 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
s_words(text)

['data',
 'is',
 'the',
 'new',
 'revolut',
 'in',
 'the',
 'world',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individu',
 'would',
 'gener',
 'terabyt',
 'of',
 'data',
 '.']

### Lemmatization

As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization.So, we added pos(parts-of-speech) as a parameter. 

In [19]:
from nltk.stem import wordnet 
from nltk.tokenize import word_tokenize 
lemma = wordnet.WordNetLemmatizer()
nltk.download('wordnet')
# lemmatize string 
def lemmatize_word(text): 
    word_tokens = word_tokenize(text) 
    # provide context i.e. part-of-speech(pos)
    lemmas = [lemma.lemmatize(word, pos ='v') for word in word_tokens] 
    return lemmas 
  
text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'
lemmatize_word(text)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Data',
 'be',
 'the',
 'new',
 'revolution',
 'in',
 'the',
 'World',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individual',
 'would',
 'generate',
 'terabytes',
 'of',
 'data',
 '.']

In [20]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [21]:
# importing tokenize library
from nltk.tokenize import word_tokenize 
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger')
  
# convert text into word_tokens with their tags 
def pos_tagg(text): 
    word_tokens = word_tokenize(text) 
    return pos_tag(word_tokens) 
  
pos_tagg('Are you afraid of something?') 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('Are', 'NNP'),
 ('you', 'PRP'),
 ('afraid', 'IN'),
 ('of', 'IN'),
 ('something', 'NN'),
 ('?', '.')]

In the above example NNP stands for Proper noun, PRP stands for personal noun, IN as Preposition. We can get all the details pos tags using the Penn Treebank tagset.

In [22]:
# downloading the tagset  
nltk.download('tagsets') 
  
# extract information about the tag 
nltk.help.upenn_tagset('PRP')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


### Chunking

Chunking is the process of extracting phrases from the Unstructured text and give them more structure to it. We also called them shallow parsing.We can do it on top of pos tagging. It groups words into chunks mainly for noun phrases. chunking we do by using regular expression. 

In [23]:
#importing libraries
from nltk.tokenize import word_tokenize  
from nltk import pos_tag 
  
# here we define chunking function with text and regular 
# expressions representing grammar as parameter 
def chunking(text, grammar): 
    word_tokens = word_tokenize(text) 
  
    # label words with pos 
    word_pos = pos_tag(word_tokens) 
  
    # create chunk parser using grammar 
    chunkParser = nltk.RegexpParser(grammar) 
  
    # test it on the list of word tokens with tagged pos 
    tree = chunkParser.parse(word_pos) 
      
    for subtree in tree.subtrees(): 
        print(subtree) 
    #tree.draw() 
      
sentence = 'the little red parrot is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar) 

(S
  (NP the/DT little/JJ red/JJ parrot/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP the/DT little/JJ red/JJ parrot/NN)
(NP the/DT sky/NN)


In the above example, we defined the grammar by using the regular expression rule. This rule tells you that NP(noun phrase) chunk should be formed whenever the chunker find the optional determiner(DJ) followed by any no. of adjectives and then a NN(noun).

Image after running above code.
<img src=".\Images\11.png">

Libraries like Spacy and TextBlob are best for chunking.

### Named Entity Recognition

It is used to extract information from unstructured text. It is used to classy the entities which is present in the text into categories like a person, organization, event, places, etc. This will give you a detail knowledge about the text and the relationship between the different entities.

In [24]:
#Importing tokenization and chunk
from nltk.tokenize import word_tokenize 
from nltk import pos_tag, ne_chunk 
nltk.download('maxent_ne_chunker')
nltk.download('words')
  
def ner(text): 
    # tokenize the text 
    word_tokens = word_tokenize(text) 
  
    # pos tagging of words 
    word_pos = pos_tag(word_tokens) 
  
    # tree of word entities 
    print(ne_chunk(word_pos)) 
  
text = 'Brain Lara scored the highest 400 runs in a test match which played in between WI and England.'
ner(text) 

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
(S
  (PERSON Brain/NNP)
  (PERSON Lara/NNP)
  scored/VBD
  the/DT
  highest/JJS
  400/CD
  runs/NNS
  in/IN
  a/DT
  test/NN
  match/NN
  which/WDT
  played/VBD
  in/IN
  between/IN
  (ORGANIZATION WI/NNP)
  and/CC
  (GPE England/NNP)
  ./.)
