# Introduction to Natural Language Processing (NLP): Basics of Text Preprocessing

Three Domains AI: 

- Computer Vision [Image and Video - Image Processing -> CNN and RNN]
- Natural Language Processing [Text and Audio] Chatbots, Alexa. Siri
- Statistical Data [ CSV,Excel -> Recommendation & Prediction using historical data]

- Agenda:
    - How NLP applications work [How chatbots work?]

- NLP: Natural Language Processing
    - Def: Natural Language Processing (NLP) is a domain of artificial intelligence (AI) focused on the interaction between computers and humans through natural language.
    - Goal: The primary goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is valuable.
    - Applications: Applications of NLP include language translation (Google Translate), chatbots (like ChatGPT), sentiment analysis (identifying emotions in tweets or reviews), and much more.

- Steps:
    - Humans communicate with machine using text from any language [English] Ex: ChatGPT Prompts --> Machine takes these texts as inputs--> Text Processing is done by machine --> Why? -> it helps to  clean, structure, and convert raw text into a form that machines can easily interpret.

- Key Steps in Text Preprocessing
    - 1. Tokenization
    - 2. Lowercasing
    - 3. Removing Stop Words
    - 4. Removing Punctuation
    - 5. Stemming and Lemmatization
    - 6. Handling Numbers
    - 7. Removing Special Characters
    - 8. Text Normalization

### Step 1. Tokenization:
    - Def.: 
        - Tokenization is the process of breaking a text into individual components or "tokens" like words, phrases, or sentences. 
        - It serves as the foundation for all further text analysis.
        
    - Example:
        - Text: "Natural language processing is fascinating."

        - Tokenized: ["Natural", "language", "processing", "is", "fascinating"]

    - Scenario based question:
        - In real-time, when analyzing product reviews, tokenizing the sentences allows you to understand each word individually.
        - For instance, in a customer review like "The laptop is fast and efficient," tokenization separates the words "laptop," "fast," and "efficient," which can help identify the core sentiments.
    

In [2]:
### Importing the required libraries
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [3]:
# Download NLTK data files (stopwords, wordnet, etc.)
u = nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rista\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rista\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# Sample Reviews in English
reviews = [
    "The movie was fantastic! The storyline kept me on the edge of my seat the entire time 🔥.",
    "This restaurant has the best pasta I've ever had. The service was quick, but the ambiance could be better.",
    "I'm disappointed with the phone's battery life. It drains way too fast, but the camera quality is gr8. I will give 10 out of 10 !",
]

In [5]:
# Step 1: Tokenization
# 1.1 Define function to Tokenizing each review into individual words (tokens)
def tokenize_review(rev):
    tokens = nltk.word_tokenize(rev)
    return tokens

#tokenize_review(reviews) #TypeError: expected string or bytes-like object, got 'list'


In [6]:
# Tokenizing each review
tokenized_reviews = [tokenize_review(rev) for rev in reviews]
print("Tokenized Reviews:")
for i, tokens in enumerate(tokenized_reviews):
    print(f"Review {i+1}:", tokens)
    #print(tokens)

Tokenized Reviews:
Review 1: ['The', 'movie', 'was', 'fantastic', '!', 'The', 'storyline', 'kept', 'me', 'on', 'the', 'edge', 'of', 'my', 'seat', 'the', 'entire', 'time', '🔥', '.']
Review 2: ['This', 'restaurant', 'has', 'the', 'best', 'pasta', 'I', "'ve", 'ever', 'had', '.', 'The', 'service', 'was', 'quick', ',', 'but', 'the', 'ambiance', 'could', 'be', 'better', '.']
Review 3: ['I', "'m", 'disappointed', 'with', 'the', 'phone', "'s", 'battery', 'life', '.', 'It', 'drains', 'way', 'too', 'fast', ',', 'but', 'the', 'camera', 'quality', 'is', 'gr8', '.', 'I', 'will', 'give', '10', 'out', 'of', '10', '!']


punkt not found error occured. Hence, import nltk and download pinkt using the command below

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rista\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Rerun the step 1: Tokenization

### Step 1: Tokenization

In [5]:
# Step 1: Tokenization
# 1.1 Define function to Tokenizing each review into individual words (tokens)
def tokenize_review(rev):
    tokens = nltk.word_tokenize(rev)
    return tokens

#tokenize_review(reviews) #TypeError: expected string or bytes-like object, got 'list'


In [11]:
# Tokenizing each review
tokenized_reviews = [tokenize_review(rev) for rev in reviews]
print("Tokenized Reviews:")
for i, tokens in enumerate(tokenized_reviews):
    print(f"Review {i+1}:", tokens)
    #print(tokens)

Tokenized Reviews:
Review 1: ['The', 'movie', 'was', 'fantastic', '!', 'The', 'storyline', 'kept', 'me', 'on', 'the', 'edge', 'of', 'my', 'seat', 'the', 'entire', 'time', '🔥', '.']
Review 2: ['This', 'restaurant', 'has', 'the', 'best', 'pasta', 'I', "'ve", 'ever', 'had', '.', 'The', 'service', 'was', 'quick', ',', 'but', 'the', 'ambiance', 'could', 'be', 'better', '.']
Review 3: ['I', "'m", 'disappointed', 'with', 'the', 'phone', "'s", 'battery', 'life', '.', 'It', 'drains', 'way', 'too', 'fast', ',', 'but', 'the', 'camera', 'quality', 'is', 'gr8', '.', 'I', 'will', 'give', '10', 'out', 'of', '10', '!']


### Step 2: Lowercasing
- Convert all the words to lowercase to maintain consistency. 
- Otherwise, words like "Phone" and "phone" might be treated as different.

    - Example:
    - Lowercased: ['the', 'movie', 'was', 'fantastic', '!', 'the', 'storyline', 'kept', 'me', 'on', 'the', 'edge', 'of', 'my', 'seat', 'the', 'entire', 'time', '.']

In [7]:
# Step 2: Lowercasing
# Converting all tokens to lowercase for uniformity
def lowercase_tokens(tokens):
    return [token.lower() for token in tokens]

In [8]:
lowercased_reviews = [lowercase_tokens(tokens) for tokens in tokenized_reviews]
print("\nLowercased Reviews:")
for i, tokens in enumerate(lowercased_reviews):
    # print(f"Review {i+1}:", tokens)
    print(tokens)


Lowercased Reviews:
['the', 'movie', 'was', 'fantastic', '!', 'the', 'storyline', 'kept', 'me', 'on', 'the', 'edge', 'of', 'my', 'seat', 'the', 'entire', 'time', '🔥', '.']
['this', 'restaurant', 'has', 'the', 'best', 'pasta', 'i', "'ve", 'ever', 'had', '.', 'the', 'service', 'was', 'quick', ',', 'but', 'the', 'ambiance', 'could', 'be', 'better', '.']
['i', "'m", 'disappointed', 'with', 'the', 'phone', "'s", 'battery', 'life', '.', 'it', 'drains', 'way', 'too', 'fast', ',', 'but', 'the', 'camera', 'quality', 'is', 'gr8', '.', 'i', 'will', 'give', '10', 'out', 'of', '10', '!']


### Step 3: Removing Stop Words
- In Hindi and English, common words like “aur” (and), “hai” (is), “bhi” (also), “the,” and “is” don’t add much value to the analysis. 
- Remove these stop words to focus on the more meaningful terms in the reviews.

- Example:
    - Without stop words: ['movie', 'fantastic', '!', 'storyline', 'kept', 'edge', 'seat', 'entire', 'time', '.']
- Now, the review was boiled down to the most relevant words.

In [9]:
# Step 3: Removing Stop Words
# Stop words are common words like 'is', 'and', 'the' that don't add significant meaning to the text
stop_words = set(stopwords.words('english')) # Combining English and Hindi stopwords

def remove_stop_words(tokens):
    return [token for token in tokens if token not in stop_words]

In [10]:
filtered_reviews = [remove_stop_words(tokens) for tokens in lowercased_reviews]
print("\nReviews after Removing Stop Words:")
for i, tokens in enumerate(filtered_reviews):
    print(f"Review {i+1}:", tokens)


Reviews after Removing Stop Words:
Review 1: ['movie', 'fantastic', '!', 'storyline', 'kept', 'edge', 'seat', 'entire', 'time', '🔥', '.']
Review 2: ['restaurant', 'best', 'pasta', "'ve", 'ever', '.', 'service', 'quick', ',', 'ambiance', 'could', 'better', '.']
Review 3: ["'m", 'disappointed', 'phone', "'s", 'battery', 'life', '.', 'drains', 'way', 'fast', ',', 'camera', 'quality', 'gr8', '.', 'give', '10', '10', '!']


### Step 4: Removing Punctuation and Emojis
Next, we remove unnecessary punctuation marks and emojis. While the fire emoji (🔥) might express enthusiasm, the machine learning model wouldn't understand it unless programmed specifically for emoji analysis.

Example:
Without punctuation and emojis: ["phone", "mast", "battery", "life", "zabardast", "camera", "sahi"]
This step further cleaned up the data and made it ready for analysis.

### clean_tokens = [token[output] for token[variable] in tokens[list] if[condition] token.isalnum()]  # Retain only alphanumeric tokens



In [11]:
# Step 4: Removing Punctuation and Emojis
# Removing punctuation and emojis from tokens
def remove_punctuation_and_emojis(tokens):
    clean_tokens = [token for token in tokens if token.isalnum()]  # Retain only alphanumeric tokens
    return clean_tokens

In [12]:
### Explanation -->     clean_tokens = [token for token in tokens if token.isalnum()]  # Retain only alphanumeric tokens

clean_tokens = [
    token         # Include the token
    for token in tokens  # Loop through each token in the 'tokens' list
    if token.isalnum()   # Check if the token is alphanumeric (removes punctuation and special characters)
]

In [13]:
cleaned_reviews = [remove_punctuation_and_emojis(tokens) for tokens in filtered_reviews]
print("\nReviews after Removing Punctuation and Emojis:")
for i, tokens in enumerate(cleaned_reviews):
    print(f"Review {i+1}:", tokens)


Reviews after Removing Punctuation and Emojis:
Review 1: ['movie', 'fantastic', 'storyline', 'kept', 'edge', 'seat', 'entire', 'time']
Review 2: ['restaurant', 'best', 'pasta', 'ever', 'service', 'quick', 'ambiance', 'could', 'better']
Review 3: ['disappointed', 'phone', 'battery', 'life', 'drains', 'way', 'fast', 'camera', 'quality', 'gr8', 'give', '10', '10']


### Step 5: Stemming and Lemmatization
- Goal: To get to the root of each word, apply stemming and lemmatization. 
- Stemming removes word endings, and 
- Lemmatization turns words into their base form. 

In [14]:
# Step 5: Stemming and Lemmatization
# Stemming reduces words to their base form (e.g., 'running' -> 'run')
# Lemmatization reduces words to their dictionary root form (e.g., 'running' -> 'run')

# Using NLTK's PorterStemmer for stemming
stemmer = PorterStemmer()
def apply_stemming(tokens):
    return [stemmer.stem(token) for token in tokens]

In [15]:
stemmed_reviews = [apply_stemming(tokens) for tokens in cleaned_reviews]
print("\nStemmed Reviews:")
for i, tokens in enumerate(stemmed_reviews):
    print(f"Review {i+1}:", tokens)


Stemmed Reviews:
Review 1: ['movi', 'fantast', 'storylin', 'kept', 'edg', 'seat', 'entir', 'time']
Review 2: ['restaur', 'best', 'pasta', 'ever', 'servic', 'quick', 'ambianc', 'could', 'better']
Review 3: ['disappoint', 'phone', 'batteri', 'life', 'drain', 'way', 'fast', 'camera', 'qualiti', 'gr8', 'give', '10', '10']


In [16]:
# Using NLTK's WordNetLemmatizer for lemmatization
lemmatizer = WordNetLemmatizer()
def apply_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

In [17]:
lemmatized_reviews = [apply_lemmatization(tokens) for tokens in cleaned_reviews]
print("\nLemmatized Reviews:")
for i, tokens in enumerate(lemmatized_reviews):
    print(f"Review {i+1}:", tokens)



Lemmatized Reviews:
Review 1: ['movie', 'fantastic', 'storyline', 'kept', 'edge', 'seat', 'entire', 'time']
Review 2: ['restaurant', 'best', 'pasta', 'ever', 'service', 'quick', 'ambiance', 'could', 'better']
Review 3: ['disappointed', 'phone', 'battery', 'life', 'drain', 'way', 'fast', 'camera', 'quality', 'gr8', 'give', '10', '10']


### Step 6. Handling Numbers
Numbers often don't carry significant meaning unless they represent quantities or rankings. You can choose to either remove, replace, or leave them as is.

Example:
Text: "He bought 3 laptops for $1500."

Without numbers: "He bought laptops for ."
Replacing numbers with placeholders: "He bought <NUM> laptops for <NUM>."
In an e-commerce setting, this can help abstract numerical data, allowing the model to focus on the context rather than specific values.

In [18]:
def nonum_rev(tokens): 
    nonum_token = [token for token in tokens if not token.isdigit()]
    return nonum_token

In [19]:
nonum_reviews = [nonum_rev(tokens) for tokens in lemmatized_reviews]

print("\nReview with handled Numbers: ")
for i, tokens in enumerate(nonum_reviews):
    print(f"Review {i+1}:", tokens)


Review with handled Numbers: 
Review 1: ['movie', 'fantastic', 'storyline', 'kept', 'edge', 'seat', 'entire', 'time']
Review 2: ['restaurant', 'best', 'pasta', 'ever', 'service', 'quick', 'ambiance', 'could', 'better']
Review 3: ['disappointed', 'phone', 'battery', 'life', 'drain', 'way', 'fast', 'camera', 'quality', 'gr8', 'give']


### Step 7. Removing Special Characters
- Special characters like @, #, $, % don’t add value to most NLP tasks unless they represent specific information, like hashtags in social media.

- Example:
    - Text: "The cost is $500 #expensive!"

    - Without special characters: "The cost is 500 expensive"

Removing special characters is particularly important when dealing with formal text documents, but it might not always be desirable in social media data, where hashtags or mentions could carry important information.

In [20]:
def remove_special_char(tokens):
    clean_tokens = [re.sub(r'[^A-Za-z0-9\s]', '', token) for token in tokens ]
    return clean_tokens

In [21]:
no_specialchar_reviews = [remove_special_char(tokens) for tokens in nonum_reviews]

print("\nFinal Review:")
for i, tokens in enumerate(no_specialchar_reviews):
    print(f"Review {i+1}:", tokens)


Final Review:
Review 1: ['movie', 'fantastic', 'storyline', 'kept', 'edge', 'seat', 'entire', 'time']
Review 2: ['restaurant', 'best', 'pasta', 'ever', 'service', 'quick', 'ambiance', 'could', 'better']
Review 3: ['disappointed', 'phone', 'battery', 'life', 'drain', 'way', 'fast', 'camera', 'quality', 'gr8', 'give']


### Step 8. Text Normalization
- Inconsistent spelling or informal language use can cause confusion in NLP models. Normalization helps by converting variations of a word to a common form.

- Example:
    - Text: "u r gr8!"

    - Normalized: "you are great!"

- This is particularly useful when working with user-generated content, such as social media posts or customer feedback, where informal language is commo

In [22]:
import unicodedata
def normal_tokens(tokens):
    normalized_text = [unicodedata.normalize('NFKD', token).encode('ascii', 'ignore').decode('utf-8') for token in tokens]
    return normalized_text

In [23]:
normalized_reviews = [normal_tokens(tokens) for tokens in no_specialchar_reviews]

print("\nNormalized Reviews:")
for i, tokens in enunormalized_reviews:
    print(f"Review {i+1}:", tokens)


Normalized Reviews:


ValueError: too many values to unpack (expected 2)