# <center> <font size = 24 color = 'steelblue'> <b>Text Pre-Processing

<div class="alert alert-block alert-info">
    
<font size = 4> 

**By the end of this notebook you will be able to:**
- Understand steps involved in text preprocessing
- Implement text oreprocessing using  python

# <a id= 't0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#t1)<br>
[2. Download the necessary corpus from NLTK](#t2)<br>
[3. Data cleaning steps](#t3)<br>
> [3.1 Tokenization](#t3.1)<br>
> [3.2 Changing case](#t3.2)<br>
> [3.3 Spelling correction](#t3.3)<br>
> [3.4 POS Tagging](#t3.4)<br>
> [3.5 Named entity recognition (NER)](#t3.5)<br>
> [3.6 Stemming and Lemmatization](#t3.6)<br>
>> [a. Stemming](#3a)<br>
>> [b. Lemmatization](#3b)<br>

> [3.7 Noise entity removal](#t3.7)<br>
>> [a. Remove stopwords](#a)<br>
>> [b. Remove urls](#b)<br>
>> [c. Remove punctuations](#c)<br>
>> [d. Remove emoticons](#d)<br>

##### <a id = 't1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [None]:
!pip install nltk
!pip install spacy
!pip install re
!pip install string
!python -m spacy download en_core_web_sm
!pip install svgling

In [None]:
import nltk
import spacy
import re
from string import punctuation

[top](#t0)

##### <a id = 't2'>
<font size = 10 color = 'midnightblue'> <b>Download necessary corpus and models from nltk

In [None]:
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('stopwords')
nltk.download('wordnet')

<div class="alert alert-block alert-info">
    
<font size = 4> 

**Note:**
    
- A LoadError will be raised whenever there is a missing corpus or model which is a dependency for some other function.
- Use `nltk.download( <name of the corpus/model> )` for downloading the requirements.



<font size = 6 color = seagreen> <b> Import the necessary corpus

In [None]:
from nltk.corpus import wordnet
from nltk.corpus import stopwords

##### <a id = 't3'>
<font size = 10 color = 'midnightblue'> <b> Data cleaning steps

<font size = 6 color = seagreen> <b> <center> Let's start by defining a custom text for preprocessing.<br>
<font size = 6 color = seagreen> <center>This text contains emoticons, punctuations urls etc.

In [None]:
text = """Embracing life's challenges is like navigating a journey. 🚀
Stay motivated, overcome hurdles, and explore new paths to success!
Check out inspiring stories at https://motivationalhub.com for an extra boost!"""
print(text)

[top](#t0)

<a id = 't3.1'>
<font size = 6 color = pwdrblue>  <b>Tokenization 

<div class="alert alert-block alert-success">
<font size = 4> 
    
- This is a process of breaking the text into individual words or tokens.
- This can be achieved by using word_tokenize or the simple split function associated with the string class.
- In case of paragraph or larger documents, sentence tokenization can also be used.
- Sentence tokenization is a crucial step in natural language processing (NLP) and text analysis, as it allows algorithms to work with smaller units of meaning.  

<font size = 5 color = seagreen>  <b>Sentence tokenization

In [None]:
sentences = nltk.sent_tokenize(text)
for i in range(len(sentences)):
    print(f"{i}:  {sentences[i]}")

<font size = 5 color = seagreen>  <b>Word tokenization

In [None]:
word_tokens = nltk.word_tokenize(text)
print(word_tokens)

<div class="alert alert-block alert-success">
<font size = 4> 

**However, if the text contains emoticons or URLs, word tokenization may split them, complicating the text cleaning process. Hence, a simple text split function could be more helpful in this context.**


In [None]:
word_tokens = text.split()
print(word_tokens)

<div class="alert alert-block alert-success">
<font size = 4> 

- <b>This also creates word tokens but keeps emoticons, urls, address handles, and hastags etc. together for further analysis.


[top](#t0)

<a id = 't3.2'>
<font size = 6 color = pwdrblue>  <b>Changing the case.

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Change of case is a text normalization process.
- This process provides for uniform representation and reduces the vocabulary size.
- Casing also eases the process of text matching, entity recognition, search and retrieval.
- Changing the casing of the data reduces redundancy and helps the ML model generalize better.

In [None]:
words_lower_case = text.lower().split()
print(words_lower_case)

[top](#t0)

<a id = 't3.3'>
<font size = 6 color = pwdrblue>  <b>Spelling correction

<div class="alert alert-block alert-success">
<font size = 4> 

- This improves text quality and avoids miscommunication.
- Spell correction helps support language models and embeddings.
- Spell correction helps reducing ambiguity and handle out of vocabulary data.

**For spelling correction we are using:**
 - `nltk.edit_distance` to measure distance between the words in the text and the vocabulary available in nltk.
 - `edit_distance` calculate the `Levenshtein edit-distance` between two strings to check similarity between words in the text and words of the valid vocabulary.

In [None]:
# Tokenize the text
word_tokens = text.lower().split()

In [None]:
# Get list of English words
words = nltk.corpus.words.words()

In [None]:
print("Total number of words in the vocabulary : ", len(words))

In [None]:
# Correct spelling of each word
corrected_tokens = []
for token in word_tokens:
    # Find the word with the lowest distance and replace it
    corrected_token = min(words, key=lambda x: nltk.edit_distance(x, token))
    corrected_tokens.append(corrected_token)
print("Corrected tokens:", corrected_tokens)

[top](#t0)

<a id = 't3.4'>
<font size = 6 color = pwdrblue>  <b>POS Tagging

<div class="alert alert-block alert-success">
<font size = 4> 
    
Part-of-Speech tagging involves assigning words in a text corpus to specific parts of speech based on their definitions and contextual usage.

In [None]:
# Tokenize the text
word_tokens = text.split()

In [None]:
# Part-of-speech tagging can be done using pos_tag function of nltk.
tagged = nltk.pos_tag(word_tokens)

In [None]:
print(tagged)

[top](#t0)

<a id = 't3.5'>
<font size = 6 color = pwdrblue>  <b>Named entity recognition (NER)

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Named entity recognition (NER) is a natural language processing (NLP) technique that involves identifying and classifying entities (objects, places, people, organizations, dates, monetary values, percentages, etc.) in text.
- Named entities can belong to various categories, such as:

|**Entity Object**| **Meaning** |
|-|-|
|Person |Individual names of people.|
|Location| Places, cities, countries, etc.|
|Organization | Names of companies, institutions, etc.|
|Date | Temporal expressions like dates and times.|
|Money| Currency amounts.|
|Percent| Percentage values.|

<font size = 5 color = seagreen> <b><center> Let's consider a different example text to understand named entity recognition

In [None]:
text_example = "In 2019, Apple Inc. announced the launch of the iPhone 11 at their headquarters in Cupertino, California, with Tim Cook, the CEO, presenting the new features."
print(text_example)

In [None]:
# tokenize the text
word_tokens = text_example.split()

In [None]:
# get the pos tags
tagged = nltk.pos_tag(word_tokens)
print(tagged)

In [None]:
named_entities = nltk.ne_chunk(tagged)
print(named_entities)

<font size = 5 color = seagreen> <b>Named entity recognition can also be implemented using spcay packages

In [None]:
# Load the pre-trained English language model
nlp = spacy.load("en_core_web_sm")

In [None]:
# Create a nlp object of the text
doc = nlp(text_example)

In [None]:
# Extract named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

In [None]:
# Print the named entities
print(entities)

[top](#t0)

<a id = 't3.6'>
<font size = 6 color = pwdrblue>  <b>Stemming and Lemmatization

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Stemming and lemmatization are techniques used in NLP and text mining to reduce words to their base or root forms, simplifying the process of analysis and text understanding.

In [None]:
text = """Embracing life's challenges is like navigating a journey. 🚀
Stay motivated, overcome hurdles, and explore new paths to success!
Check out inspiring stories at https://motivationalhub.com for an extra boost!"""

<font size = 5 color = seagreen> <b>Let's start by tokenising the text

In [None]:
word_tokens = text.lower().split()

<a id = '3a'>
<font size = 5 color = seagreen> <b> Stemming

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Stemming is the process of removing suffixes or prefixes from words to obtain their root or base form, known as the stem. The goal is to reduce words to a common form, even if it is not a valid word.
- Porter stemmer is one of the most used stemming technique.

In [None]:
# create stemmer object
stemmer = nltk.stem.PorterStemmer()

In [None]:
# stem each token
stemmed_tokens = [stemmer.stem(token) for token in word_tokens]

In [None]:
print("Stemmed tokens:", stemmed_tokens)

<a id = '3b'>
<font size = 5 color = seagreen> <b> Lemmatization

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma.
- Lemmatization considers the context and meaning of a word and produces valid words.
- NLTK provides wordnet based lemmatizer.

In [None]:
# Create lemmatizer object
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
# Lemmatize each token
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in word_tokens]

In [None]:
print("Lemmatized tokens:", lemmatized_tokens)

<div class="alert alert-block alert-success">
<font size = 4> 
    
**Using PoS tagging in lemmatization**
  - For implementation of PoS tag based lemmatization, we pass the PoS tag for each word in the sentence.
  - To acheive this we need to first map PoS tags from Penn Treebank to WordNet PoS tags.
  - The below function performs the task:

In [None]:
# pos tag mapping
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
# Get the pos tag
tagged = nltk.pos_tag(word_tokens)

In [None]:
# Get the root word for each of the tokens using their corresponding pos-tags
lemma_sent = []
for word, tag in tagged:
    new_tag = pos_tagger(tag)
    lemma = lemmatizer.lemmatize(word, new_tag)
    lemma_sent.append(lemma)

In [None]:
print(f"Original sentence : \n{text}")

In [None]:
print(f"Lemmatized sentence : \n{' '.join(lemma_sent)}")

[top](#t0)

<a id = 't3.7'>
<font size = 6 color = pwdrblue>  <b>Noise entity removal

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Noise entity removal in NLP involves the identification and removal of irrelevant or undesired entities from a given text.
- Noise entities can be entities that are not relevant to the analysis or entities that add unnecessary complexity to the task at hand.

In [None]:
text = """Embracing life's challenges is like navigating a journey. 🚀
Stay motivated, overcome hurdles, and explore new paths to success!
Check out inspiring stories at https://motivationalhub.com for an extra boost!"""

In [None]:
# Tokenization
word_tokens = text.lower().split()

In [None]:
# PoS tagging
tagged = nltk.pos_tag(word_tokens)

In [None]:
# Lemmatization
lemma_sent = []
for word, tag in tagged:
    new_tag = pos_tagger(tag)
    lemma = lemmatizer.lemmatize(word, new_tag)
    lemma_sent.append(lemma)

<a id = 'a'>
<font size = 5 color = seagreen> <b> a. Remove stopwords

<div class="alert alert-block alert-success">
<font size = 4> 

- Identify and remove common stopwords (e.g., "is," "the," "and") that do not carry much semantic meaning.
- This can help in focusing on more meaningful entities.

In [None]:
# Obtain the list of stopwords from the corpus
stp_wrds_eng = stopwords.words('english')
print(stp_wrds_eng)

In [None]:
# Removing stopwords
text_clean = [w for w in lemma_sent if w not in stp_wrds_eng]
print(f"Lemmatized : \n{' '.join(lemma_sent)}")
print(f"Cleaned  : \n{' '.join(text_clean)}")

<a id = 'b'>
<font size = 5 color = seagreen> <b> b. Removing urls

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Urls are not essential for many analysis process, hence they need to be removed.
- We can use regex to identify and remove the urls.

In [None]:
# Identifying and substituting urls using the pattern 'https\S+' for urls
text_clean = re.sub(r'http\S+', '', ' '.join(text_clean), flags=re.MULTILINE)
print(text_clean)

<a id = 'c'>
<font size = 5 color = seagreen> <b> c. Remove punctuations

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Punctuations are not always useful for anlaysis hence they shall also be removed.
- The list of punctuations can be obtained from the `string` package.

In [None]:
text_clean = [w for w in text_clean if w not in punctuation]
print(f"Cleaned  : \n{''.join(text_clean)}")

<div class="alert alert-block alert-info">
<font size = 4> 

**Note:**
- In sentiment analysis sometimes punctuations like ! or ? may be significant for analysis.
- Text cleaning steps should be customised based on the analysis objective.



<a id = 'd'>
<font size = 5 color = seagreen> <b>d. Remove emoticons

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Most of the text from the social media is nowadays filled with emoticons.
- Handling emoticons becomes a necessary part of NLP pipeline.
- They may be directly removed for simplicity.
- This can also be achieved using regex by specifying the unicodes for these emoticons as given below:

In [None]:
RE_EMOJI = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)
def strip_emoji(text):
    return RE_EMOJI.sub(r'', text)

In [None]:
# Use function to remove emoticons from text
text_clean = strip_emoji(''.join(text_clean))
print(text_clean)

<div class="alert alert-block alert-info">
<font size = 4> 

**Note :**
 - Emoticons may be replaced with their intended meaning in form of text.
 - For example: 😀 translates to  happy face.
 - This process is used in the vader sentiment package in data cleaning steps in sentiment analysis.



[top](#t0)