# Natural Language Processing

## Text Preprocessing Techniques


Text preprocessing is the initial step in Natural Language Processing (NLP) pipelines, essential for transforming raw text data into a format suitable for analysis and modeling. This process involves cleaning and standardizing the text to remove noise, irregularities, and inconsistencies that may impede accurate interpretation by NLP algorithms. Text preprocessing encompasses several tasks aimed at enhancing the quality and usability of textual data.

###### **Core Preprocessing Steps:**

1. Noise Removal
2. Normalization
3. Tokenization
4. Stopword Removal
5. Stemming and Lemmatization

**Additional Preprocessing Techniques:**

Text preprocessing techniques go beyond the core steps and can be applied based on specific needs and domains. These additional techniques include:

1. Spell Checking and Correction
2. Entity Recognition and Replacement
3. Handling Rare or Out-of-Vocabulary (OOV) Words
4. Handling Emojis in text

By combining both core preprocessing steps and additional techniques, practitioners can tailor their preprocessing approach to suit the requirements of the task at hand and achieve optimal results in NLP applications.

### NOISE REMOVAL:

Noise in text data refers to irrelevant or distracting elements that do not contribute to the meaning or analysis of the text. Common types of noise in text data include:
1. HTML tags
2. Special characters
3. Numerical Digits
4. Whitespace
5. URL's and Email addresses

To remove noise from text data, various techniques can be employed, including:
1. Regular Expressions
2. String Manipulation
3. Specialized libraries

In [2]:
# With Regular Expressions

import re

def remove_noise(text):
    # Remove HTML tags
    clean_text = re.sub(r'<.*?>', '', text)
    
    # Remove special characters and numerical digits
    clean_text = re.sub(r'[^\w\s]', '', clean_text)
    
    # Remove URLs and email addresses
    clean_text = re.sub(r'\b(?:https?|ftp|mailto)://\S+|www\.\S+|[\w\.-]+@[\w\.-]+\.\w+\b', '', clean_text)
    
    # Remove extra whitespace
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()
    
    return clean_text

# Example text with noise
text_with_noise = "<p>Hello, world! This is an example text with special characters: $%^& and numbers: 123.</p>"
clean_text = remove_noise(text_with_noise)
print("Cleaned Text:", clean_text)

Cleaned Text: Hello world This is an example text with special characters and numbers 123


There are various options available for removing noise using specialized libraries. One such option is BeautifulSoup. For example, we can utilize BeautifulSoup to clean HTML tags as follows:

In [3]:
from bs4 import BeautifulSoup

def remove_html_tags(text):
    # Parse HTML content
    soup = BeautifulSoup(text, "html.parser")
    # Extract text content
    clean_text = soup.get_text()
    return clean_text

# Example text with HTML tags
html_text = "<p>Hello, <b>world</b>!</p>"

# Remove HTML tags
clean_text = remove_html_tags(html_text)
print("Cleaned Text:", clean_text)


Cleaned Text: Hello, world!


### NORMALIZTION:

Normalization in text preprocessing involves standardizing text data to ensure consistency and facilitate analysis. Common normalization techniques include:
1. Converting text tot lowercase or uppercase
2. Handling Contrations
3. Removing Diacritics

In [4]:
import contractions
import re

def normalize_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Expand contractions
    text = contractions.fix(text)
    
    # Remove diacritics
    text = ''.join(char for char in text if char.isalnum() or char.isspace())
    
    return text

# Example text with variations
text_with_variations = "Don't you think Mr. Smith's car isn't beautiful? It's Dr. Brown's car."

# Normalize text
normalized_text = normalize_text(text_with_variations)
print("Normalized Text:", normalized_text)

Normalized Text: do not you think mr smiths car is not beautiful it is dr browns car


###### Explanation:
Lowercasing: The text is converted to lowercase using the lower() method.

Expanding Contractions: The contractions.fix() function from the contractions library is used to expand contractions in the text. Contractions are shortened versions of words or phrases, such as "don't" for "do not" or "isn't" for "is not".

Removing Diacritics: In this code, the isalnum() and isspace() methods are used in a list comprehension to remove any characters that are not alphanumeric or whitespace characters. This effectively removes diacritics from the text.

### **Tokenization:**

Tokenization is the process of breaking down raw text into smaller, meaningful units called tokens. These tokens can be words, phrases, or symbols, depending on the specific task or application.

##### Types of Tokenization:

1. **Word Tokenization**:
    In word tokenization, the text is split into individual words based on whitespace or punctuation.
   

2. **Sentence Tokenization**:
        Sentence tokenization involves splitting the text into individual sentences based on punctuation marks such as periods, exclamation marks, or question marks.
   

3. **Whitespace Tokenization**:
    Whitespace tokenization splits the text into tokens based solely on whitespace characters (spaces, tabs, newlines).
    

4. **Regular Expression Tokenization**:
    Regular expression tokenization uses predefined patterns to split the text into tokens. This allows for more flexibility in tokenizing text based on specific criteria.

The code below demonstrates how to perform word, sentence tokenization and WhitespaceTokenizer using the NLTK library in Python.

Import NLTK Tokenizers: word_tokenize, sent_tokenize and WhitespaceTokenizer. These functions are used for word, sentence and whitespace tokenization, respectively.

In [8]:
import re
from nltk.tokenize import word_tokenize, sent_tokenize

# Sample text 
text = "Tokenization is the process of breaking down raw text into smaller, meaningful units called tokens. It is crucial in natural language processing."

# Word tokenization
words = word_tokenize(text)

# Sentence tokenization
sentences = sent_tokenize(text)

print("Word Tokenization:", words)
print("\nSentence Tokenization:", sentences)

Word Tokenization: ['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'raw', 'text', 'into', 'smaller', ',', 'meaningful', 'units', 'called', 'tokens', '.', 'It', 'is', 'crucial', 'in', 'natural', 'language', 'processing', '.']

Sentence Tokenization: ['Tokenization is the process of breaking down raw text into smaller, meaningful units called tokens.', 'It is crucial in natural language processing.']


In [7]:
from nltk.tokenize import WhitespaceTokenizer


#Sample text 
text = "The quick-brown, fox;jumps.over!the?lazy/dog."

# Whitespace tokenization
whitespace_tokenizer = WhitespaceTokenizer()
whitespace_tokens = whitespace_tokenizer.tokenize(text)

# Regular expression tokenization excluding punctuation
regex_tokens = re.findall(r'\b\w+\b', text)

print("Whitespace Tokenization:", whitespace_tokens)
print("Regular Expression Tokenization excluding Punctuation:", regex_tokens)


Whitespace Tokenization: ['The', 'quick-brown,', 'fox;jumps.over!the?lazy/dog.']
Regular Expression Tokenization excluding Punctuation: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


While whitespace tokenization is straightforward, regular expression tokenization provides more flexibility and control over the tokenization process. Depending on the complexity of the text and the specific requirements of your NLP task, you may choose the appropriate tokenization method accordingly.

### **Stopword Removal:**

Stopwords are common words in a language that often occur frequently but typically do not carry significant meaning or contribute much to the understanding of the text. Examples of stopwords in English include "the", "is", "and", "to", etc. Stopword removal is the process of eliminating these stopwords from the text to focus on the more meaningful words.

For handling stopwords, the NLTK toolkit provides a convenient way to access a predefined list of stopwords, which typically consists of the most common words in a language. If you're using NLTK for the first time, you'll need to download the stopwords data using the following code:

In [9]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sundh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Once the download is complete, you can load the stopwords package from the NLTK corpus and utilize it in your text preprocessing pipeline.

Here's how you can print the list of stopwords in the English language:

In [10]:
from nltk.corpus import stopwords

# Load the English stopwords
english_stopwords = stopwords.words('english')

print("List of Stopwords in English Language:")
print(english_stopwords)


List of Stopwords in English Language:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',

The code below removes English stopwords from a sample text using NLTK. It tokenizes the text, filters out stopwords, and reconstructs the text without stopwords.

In [11]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
tokens = word_tokenize(text)

# Get the English stopwords list
stop_words = set(stopwords.words('english'))

# Filter out stopwords from the tokens
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Join the filtered tokens back into a sentence
filtered_text = ' '.join(filtered_tokens)

print("Original Text:", text)
print("Text after Stopword Removal:", filtered_text)


Original Text: The quick brown fox jumps over the lazy dog.
Text after Stopword Removal: quick brown fox jumps lazy dog .


### **Stemming and Lemmatization in NLP:**

Stemming and lemmatization are techniques used in natural language processing (NLP) to normalize words by reducing them to their base or root forms. While both methods aim to achieve similar results, they differ in their approaches and the level of normalization they provide.

### Stemming:

Stemming is the process of removing prefixes and suffixes from words to obtain their root or base form, known as the stem. Stemming algorithms apply heuristic rules to trim variations of words, which may not always result in valid words. This method is more aggressive than lemmatization and may produce stems that are not actual words.

### Lemmatization:

Lemmatization, on the other hand, involves analyzing the morphological structure of words to determine their lemma, which is the canonical or dictionary form of the word. Unlike stemming, lemmatization ensures that the resulting words are valid and exist in the language's dictionary. Lemmatization considers factors such as part of speech and context to accurately determine the lemma of a word.

###### **Available Algorithms:**

- Porter Stemmer: One of the most widely used stemming algorithms, known for its simplicity and efficiency.
- Snowball Stemmer: An extension of the Porter Stemmer, offering support for multiple languages.
- Lancaster Stemmer: Known for its aggressive stemming approach, often producing shorter stems.
- WordNet Lemmatizer: Based on WordNet lexical database, providing accurate lemmatization based on part of speech.
- SpaCy Lemmatizer: Part of the SpaCy library, offering robust lemmatization capabilities with support for multiple languages.

In [13]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Sample text
text = "The beautiful flowers were blooming in the garden."

# Tokenize the text
tokens = word_tokenize(text)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Stemming
stemmed_words = [stemmer.stem(word) for word in tokens]

# Lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Text:", text)
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)



Original Text: The beautiful flowers were blooming in the garden.
Stemmed Words: ['the', 'beauti', 'flower', 'were', 'bloom', 'in', 'the', 'garden', '.']
Lemmatized Words: ['The', 'beautiful', 'flower', 'were', 'blooming', 'in', 'the', 'garden', '.']


In this example, we use NLTK to perform stemming and lemmatization on the sample text. The PorterStemmer is used for stemming, while the WordNetLemmatizer is used for lemmatization. Each word in the text is processed individually to obtain its stemmed or lemmatized form.

### **Spell Checking and Correction in NLP:**

Spell checking and correction is a crucial task in natural language processing (NLP) that aims to identify and rectify misspelled words in text data. Accurate spelling is essential for effective communication and understanding in various NLP applications, such as text processing, search engines, and document analysis.

###### Types of Spell Checking and Correction:

1. Dictionary-Based Approach
2. Probabilistic Approach
3. Rule-Based Approach
###### Algorithms and Techniques:

- Edit Distance
- Phonetic Algorithms

In [18]:
from textblob import TextBlob

text = "The speeling of thise sentince is incorrrect."

# Create a TextBlob object
blob = TextBlob(text)

# Correct spelling errors
corrected_text = blob.correct()

print("Original Text:", text)
print("Corrected Text:", corrected_text)

Original Text: The speeling of thise sentince is incorrrect.
Corrected Text: The spelling of this sentence is incorrect.


This code uses TextBlob to create a TextBlob object from the sample text. The correct() method of the TextBlob object automatically corrects spelling errors in the text, and the corrected version is stored in the corrected_text variable. 

### **Entity Recognition and Replacement in NLP:**

Entity recognition and replacement is a fundamental task in natural language processing (NLP) that involves identifying and replacing named entities in text data with standardized representations. Named entities are specific objects, people, locations, organizations, dates, and other types of entities mentioned in text.

In this example, we use spaCy to perform named entity recognition and replacement on the sample text. The en_core_web_sm model from spaCy is used to process the text and identify named entities. 

In [20]:
import spacy

# Load the English NER model from spaCy
nlp = spacy.load("en_core_web_sm")

# Sample text with named entities
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in Cupertino, California, in 1976."

# Process the text using spaCy NER model
doc = nlp(text)

# Replace named entities with generic labels
replaced_text = ' '.join([ent.ent_type_ if ent.ent_type_ else ent.text for ent in doc])

print("Original Text:", text)
print("\nReplaced Text:", replaced_text)

Original Text: Apple Inc. was founded by Steve Jobs and Steve Wozniak in Cupertino, California, in 1976.

Replaced Text: ORG ORG was founded by PERSON PERSON and PERSON PERSON in GPE , GPE , in DATE .


Here the named entities like "Apple Inc.", "Steve Jobs", "Steve Wozniak", "Cupertino", "California", and "1976" are replaced with their respective generic labels such as "ORG" (organization), "PERSON" (person), "GPE" (geopolitical entity), and "DATE" (date).

###### Below is the list of some generic labels:
['PERSON', 'ORG', 'LOC', 'GPE', 'DATE', 'TIME', 'MONEY', 'PERCENT', 'CARDINAL', 'ORDINAL', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE', 'FAC', 'PRODUCT', 'NORP']

### **Handling Emojis in Text**

With the increasing prevalence of social media platforms, emojis have become an essential part of online communication. However, for certain text analysis tasks, the presence of emojis may be undesirable. In such cases, a practical solution is to remove emojis from the text. Below is a convenient function that accomplishes this task effectively:

The remove_emoji function below utilizes a regular expression pattern to identify and eliminate emojis from the input string. It covers a wide range of emojis, including emoticons, symbols, flags, and more, ensuring comprehensive removal.

In [21]:
import re

def remove_emoji(string):
    # Define the emoji pattern
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    # Remove emojis from the string
    return emoji_pattern.sub(r'', string)

# Test the function
print(remove_emoji("festivities are ongoing 🔥🔥")) 
print(remove_emoji("Amusing 😂"))   


festivities are ongoing 
Amusing 


Additionally, we may explore methods to convert emojis into their corresponding textual repesentations in future discussions. Stay tuned for further insights on this topic.Additionally, we may explore methods to convert emojis into their corresponding textual repesentations in future discussions. Stay tuned for further insights on this topic.

### Closing Statement:
We have covered a wide array of text preprocessing techniques essential for Natural Language Processing (NLP) tasks. From the core steps like noise removal, normalization, tokenization, stopword removal, stemming, and lemmatization to additional techniques such as spell checking, entity recognition, handling out-of-vocabulary words, and emoji removal, we've explored the tools and methods necessary for refining and preparing text data for analysis and modeling.

These preprocessing steps are crucial for enhancing the accuracy and effectiveness of NLP algorithms, enabling them to extract meaningful insights from textual data. As we continue to delve into more fascinating topics in upcoming articles, spanning various domains of artificial intelligence such as machine learning algorithms, neural networks, computer vision, and more, we'll further unravel the intricacies of text preprocessing and its role in unlocking the full potential of AI applications. Stay tuned for more exciting insights and discoveries in the realm of artificial intelligence!