<a href="https://colab.research.google.com/github/shubheshswain91/Machine-learning/blob/master/text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
import string
import re

**Text Lowercase**

In [2]:
def text_lowercase(text):
  return text.lower()

In [3]:
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
print(text_lowercase(input_str))

hey, did you know that the summer break is coming? amazing right !! it's only 5 more days !!


**Remove numbers**

In [4]:
def remove_numbers(text):
  result = re.sub(r'\d+', '', text)
  return result

  

In [5]:
input_str = "There are 3 balls in this bag, and 12 in the other one."
print(remove_numbers(input_str))

There are  balls in this bag, and  in the other one.


We can convert the decimal numbers into the words using inflect library

In [6]:
import inflect

p = inflect.engine()

In [9]:
def convert_number(text):
  #split strings into tokens

  temp_str = text.split(' ')

  #intialize empty list
  new_string = []

  for word in temp_str:
    #if word is digit, then convert it to the digit
    # numbers and append into the new string list
    if word.isdigit():
      temp = p.number_to_words(word)
      new_string.append(temp)

    # append the word as it is
    else:
      new_string.append(word)

   #join the words of new_string to form a string
  temp_str = ' '.join(new_string)
  return temp_str
     

In [12]:
input_str = 'There are 31 balls in this bag, and 12 in the other one.'
convert_number(input_str)

'There are thirty-one balls in this bag, and twelve in the other one.'

**Remove punctuation**
We remove punctuations so that we don’t have different forms of the same word. If we don’t remove the punctuation, then been. been, been! will be treated separately.

In [13]:
def remove_punctuation(text):
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

In [14]:
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
remove_punctuation(input_str)

'Hey did you know that the summer break is coming Amazing right  Its only 5 more days '

**Remove whitspaces**

In [15]:
def remove_whitespace(text):
  return " ".join(text.split())

In [27]:
input_str = "   we don't need   the given questions"
print(len(input_str))
temp = input_str.split()
temp
remove_whitespace(input_str)


38


"we don't need the given questions"

In [24]:
str = "we don't need the given questions"
print(len(str))

33


**Remove default stop words**

Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [34]:
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [35]:
def remove_stopwords(text):
  stop_words = set(stopwords.words("english"))
  word_tokens = word_tokenize(text)
  filtered_text = [word for word in word_tokens if word not in stop_words]
  return filtered_text
  

In [36]:
example_text = "This is a sample sentence and we are going to remove the stopwords from this."
remove_stopwords(example_text)

['This', 'sample', 'sentence', 'going', 'remove', 'stopwords', '.']

**Stemming**

Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, -ize, -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words.

Example:

books      --->    book
looked     --->    look
denied     --->    deni
flies      --->    fli

If the text is not in tokens, then we need to convert it into tokens. After we have converted strings of text into tokens, we can convert the word tokens into their root form. There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them.

In [37]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()

In [42]:
def stem_words(text):
  word_tokens = word_tokenize(text)
  stems = [stemmer.stem(word) for word in word_tokens]
  return stems

In [43]:
text = 'data science uses scientific methods algorithms and many types of processes'
stem_words(text)

['data',
 'scienc',
 'use',
 'scientif',
 'method',
 'algorithm',
 'and',
 'mani',
 'type',
 'of',
 'process']

**Lemmatization**

Like stemming, lemmatization also converts a word to its root form. The only difference is that lemmatization ensures that the root word belongs to the language. We will get valid words if we use lemmatization. In NLTK, we use the WordNetLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter.

In [47]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [48]:
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas

In [49]:
text = 'data science uses scientific methods algorithms and many types of processes'
lemmatize_word(text)

['data',
 'science',
 'use',
 'scientific',
 'methods',
 'algorithms',
 'and',
 'many',
 'type',
 'of',
 'process']