## What is Text Normalization?
Text normalization is a key step in NLP that cleans and preprocesses data into a usable, standard and “less-random” format. <br><br>

Text normalization involves various techniques such as lowercasing, removing special characters and stop words removal etc.

### Why do we need text normalization?
Here are two main reasons why we need text normalization: <br><br>

1. Reduces complexity:<br><br>
Human language is full of complexities such as slangs, abbreviations and different grammatical forms of the same word.<br>
Text normalizations helps reduce these complexities by transforming the text into a standard and consistent format.<br>

2. Improves Efficiency:<br><br>
By reducing the number of unique forms that a word can take, text normalization improves the efficiency of NLP models.
For instance, a model doesn’t need to learn the difference between “play” and “playing” if it understands they both convey the same core meaning.

### Techniques of text normalization:
Following are some of the main techniques used for text normalization:

#### 1. Lowercasing:
Lowercasing is a technique that transforms all text into lowercase to ensure standard formats for all characters.<br><br>

Here’s a simple function that implements lowercasing with python:

In [1]:
def lowercase_text(text):
  """
  This function takes text and returns the text in lowercase
  """

  return text.lower()

In [2]:
text = "This is a Sample Text to demonstrate lowercasing a piece of TEXT"

lowercase_text(text)

'this is a sample text to demonstrate lowercasing a piece of text'

#### 2. Removing punctuation:
There are cases where we need to get rid of punctuations. <br><br>

For example, if your word embeddings matrix doesn’t support special characters, we need to get rid of them.<br><br>

Here’s a short function that implements punctuation removal:

In [5]:
import string
punctuations = list(string.punctuation)

def remove_punctuations(text,punctuations):
    for punctuation in punctuations:
        if punctuation in text:
            text = text.replace(punctuation, '')
    return text.strip()

text = "Hello! How are you doing today? I hope everything is going well. Don't forget to bring your umbrella when it rains, and make sure to smile! :)"

text_without_punct = remove_punctuations(text, punctuations)
text_without_punct


'Hello How are you doing today I hope everything is going well Dont forget to bring your umbrella when it rains and make sure to smile'

### 3. Stemming & lemmatization:
Stemming and lemmatization are techniques that reduce a word to its base form. <br><br>

For example, “playing”, “played”, “plays” are all reduced to “play” and hence, converting all these forms to a standard format.<br><br>

Here’s a python code that implements stemming:

In [6]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
 
stemmer = PorterStemmer()
 
sentence = "The quick brown foxes are jumping over the lazy dogs"
words = word_tokenize(sentence)
 
for word in words:
    print(word, ": ", stemmer.stem(word))

The :  the
quick :  quick
brown :  brown
foxes :  fox
are :  are
jumping :  jump
over :  over
the :  the
lazy :  lazi
dogs :  dog


Here’s a python code that implements lemmatization:

In [7]:
import spacy

nlp = spacy.load('en_core_web_sm')

text = "The quick brown foxes are jumping over the lazy dogs"

text = nlp(text)

lemmatized_tokens = [token.lemma_ for token in text]

for original, lemmatized in zip(text,lemmatized_tokens):
    print(str(original) + ": " + lemmatized)

The: the
quick: quick
brown: brown
foxes: fox
are: be
jumping: jump
over: over
the: the
lazy: lazy
dogs: dog


### 4. Stop words Removal:
For a variety of NLP tasks, words like “are”, “the”, “an” or “on” do not carry any useful information.<br><br>

Hence, we remove these stop words for efficiency and reducing complexity.<br><br>

Here’s a sample python function that accomplishes this:

In [8]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_stopwords(sentence):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sentence)

    filtered_sentence = [word for word in word_tokens if not word.lower() in stop_words]

    return ' '.join(filtered_sentence)

remove_stopwords(text_without_punct)

'Hello today hope everything going well Dont forget bring umbrella rains make sure smile'

### 5. Expanding contractions:
Contractions are words like “I’m”, “We’re” or “doesn’t”.<br><br>

These are basically a short way of writing “I am”, “We are” and “Does not” respectively.<br><br>

There are two main reasons why we should expand such contractions:<br><br>

1. Computer doesn’t understand that “I’m” and “I am” mean the same thing.<br>
2. It increases dimensionality of document-term matrix as we have to have separate columns for “I’m” and “I am”.<br><br>

Here’s a python function that expands contractions:

In [9]:
import contractions
 
def expand_contractions(text):
  expanded_text = []
  
  for word in text.split():
    expanded_text.append(contractions.fix(word))  
     
  expanded_text = ' '.join(expanded_text)
  return expanded_text

In [10]:
text = "I can't believe it's already Friday! It's been a long week, hasn't it? I'm looking forward to the weekend."

expand_contractions(text)

'I cannot believe it is already Friday! It is been a long week, has not it? I am looking forward to the weekend.'