<div style="background-color: #f9f9fc; color: #333366; border-radius: 12px; margin: 20px auto; padding: 20px; border: 2px solid #ff4c4c; max-width: 1000px; font-family: Arial, sans-serif; line-height: 1.6;">
  <h2 style="text-align: center; color: #333366;">🔍 Text Preprocessing in NLP Pipeline</h2>

---

## **1. Introduction**
   - Overview of text preprocessing and its importance in NLP.

## **2. Preprocessing Techniques**
   - **Lowercasing**
   - **Remove HTML Tags**
   - **Remove URLs**
   - **Remove Punctuation**
   - **Chat Word Treatment** (Handling informal or abbreviated words)
   - **Spelling Correction**
   - **Removing Stop Words**
   - **Handling Emojis**
   - **Tokenization** (Splitting text into words/sentences)
   - **Stemming** (Reducing words to their root form)
   - **Lemmatization** (Finding the base form of words)

## **3. Assignment**
   - Practical implementation of text preprocessing steps.

<div style="background-color: #f9f9fc; color: #333366; border-radius: 12px; margin: 20px auto; padding: 20px; border: 2px solid #ff4c4c; max-width: 1000px; font-family: Arial, sans-serif; line-height: 1.6;">
  <h2 style="text-align: center; color: #333366;">Introduction to NLP Data Processing</h2>

---

## **1. Data Acquisition**
   - **Sources of Data:**
     - **Web Scraping**
     - **APIs**

## **2. Dataset Preparation**
   - Acquired data is processed for NLP tasks.

## **3. Text Preprocessing**
   - **Basic Preprocessing:**
     - Lowercasing, punctuation removal, stop-word removal, etc.
   - **Advanced Preprocessing:**
     - **POS Tagging** (Identifying parts of speech)
     - **Chunking** (Grouping words into meaningful phrases)
     - **Parsing** (Understanding sentence structure)
     - **Coreference Resolution** (Handling pronoun references)

This markdown structure ensures clarity and a logical flow of information. Let me know if you need any modifications! 🚀


### HTML Tag remove.

In [2]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(f'', text)

In [3]:
text = "<html> <body> <p>Movie 1</p> <p>Actor - Aamir Khan</p> <p>Click here to <a href='http://google.com'>download</a></p> </body> </html>"

In [4]:
remove_html_tags(text)

'  Movie 1 Actor - Aamir Khan Click here to download  '

### Remove URLs.

In [5]:
import re

def remove_url(text):
    """Removes URLs from the input text."""
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [6]:
# Sample texts
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

In [7]:
# Removing URLs
cleaned_text = remove_url(text2)
print(cleaned_text)

Check out my notebook 


### Remove Punctuations

In [11]:
import string, time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
# Define punctuation characters
exclude = string.punctuation

In [14]:
def remove_punc(text):
    """Removes punctuation from the input text."""
    for char in exclude:
        text = text.replace(char, '')  # Replace punctuation with an empty string
    return text

In [15]:
# Example
text = "String. With, Punctuation?"

String With Punctuation


In [16]:
cleaned_text = remove_punc(text)
print(cleaned_text)

String With Punctuation


In [18]:
import time

start = time.time()  # Call time.time() correctly
print(remove_punc(text))  # Call the function to remove punctuation
time1 = time.time() - start  # Call time.time() again properly

print(f"Execution Time: {time1:.6f} seconds")

String With Punctuation
Execution Time: 0.000000 seconds


In [8]:
import re

def remove_punctuation(text):
    """Removes punctuation from the input text."""
    pattern = re.compile(r'[!"#$%&\'()*+,\-./:;<=>?@\[\\\]^_`{|}~]')
    return pattern.sub(r'', text)

In [9]:
# Example usage
text = "Hello, World! This is a test: does it remove punctuation?"

In [10]:
cleaned_text = remove_punctuation(text)
print(cleaned_text)

Hello World This is a test does it remove punctuation


### Chat abbreviations

In [20]:
def chat_conversion(text):
    """Replaces common chat abbreviations with their full forms."""
    chat_words = {
        "IMHO": "In My Honest/Humble Opinion",
        "FYI": "For Your Information",
        "BRB": "Be Right Back",
        "LOL": "Laugh Out Loud",
        "OMG": "Oh My God",
        "TTYL": "Talk To You Later"
    }
    
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:  # Convert to uppercase to match keys
            new_text.append(chat_words[w.upper()])  # Replace with full form
        else:
            new_text.append(w)  # Keep original word if not found

    return " ".join(new_text)

In [21]:
print(chat_conversion('IMHO he is the best'))   #Output:'In My Honest/Humble Opinion he is the best'

In My Honest/Humble Opinion he is the best


In [22]:
print(chat_conversion('FYI delhi is the capital of india'))  #Output:'For Your Information delhi is the capital of india'

For Your Information delhi is the capital of india


### Spelling corrections

In [23]:
from textblob import TextBlob

# incorrect text
incorrect_text = "ceertain conditionas during seveal ggnerations aree moodified in the saame maner."

# Create a TextBlob object
textBlob = TextBlob(incorrect_text)

# Correct spelling errors
corrected_text = textBlob.correct()

# Print corrected text
print(corrected_text)

certain conditions during several generations are modified in the same manner.


### Stopwords Remove

In [24]:
from nltk.corpus import stopwords

In [25]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [34]:
import nltk
from nltk.corpus import stopwords

# Ensure stopwords are downloaded
nltk.download('stopwords')

def remove_stopwords(text):
    """Removes English stopwords from the given text."""
    stop_words = set(stopwords.words('english'))
    
    # Using list comprehension for efficiency
    filtered_text = " ".join([word for word in text.split() if word.lower() not in stop_words])
    
    return filtered_text

# Example Usage
text = "Probably my all-time favorite movie, a story of selflessness, sacrifice, and dedication to a noble cause."
cleaned_text = remove_stopwords(text)
print(cleaned_text)

Probably all-time favorite movie, story selflessness, sacrifice, dedication noble cause.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Bangla Stopword Removal Function

In [1]:
import nltk

# Bangla stopword
bangla_stopwords = set([
    "আমি", "আমার", "আমাদের", "তুমি", "তোমার", "সে", "এই", "তা", "উপর", "কিছু", "কেন", "যা", "যেমন",
    "যদি", "যে", "যেখানে", "সব", "সেটা", "হয়", "হতে", "হবে", "হয়েছে", "ছিল", "ছিলো", "দিয়ে", "না"
])

In [4]:
bangla_stopwords = set([
    "আমি", "আমার", "আমাদের", "তুমি", "তোমার", "সে", "এই", "তা", "উপর", "কিছু", 
    "কেন", "যা", "যেমন", "যদি", "যে", "যেখানে", "সব", "সেটা", "হয়", "হতে", 
    "হবে", "হয়েছে", "ছিল", "ছিলো", "দিয়ে", "না"
])

def remove_stopwords(text):
    """Removes Bangla stopwords from the given text."""
    
    filtered_text = " ".join([word for word in text.split() if word not in bangla_stopwords])
    return filtered_text

# text
text = "ন্যাচারাল ল্যাঙ্গুয়েজ প্রসেসিং একটা খুবই গুরুত্বপূর্ণ টপিকস আর্টিফিশিয়াল ইন্টেলিজেন্সে"
cleaned_text = remove_stopwords(text)
print(cleaned_text)

ন্যাচারাল ল্যাঙ্গুয়েজ প্রসেসিং একটা খুবই গুরুত্বপূর্ণ টপিকস আর্টিফিশিয়াল ইন্টেলিজেন্সে


In [3]:
text2 = "ন্যাচারাল ল্যাঙ্গুয়েজ প্রসেসিং একটা খুবই গুরুত্বপূর্ণ টপিকস আর্টিফিশিয়াল ইন্টেলিজেন্সে"
result = remove_stopwords(text2)

In [29]:
# Remove stopwords
filtered_text = " ".join([word for word in text.split() if word not in bangla_stopwords])
print(filtered_text)

সাথে দেখা করতে চাই


### Emoji Removal Function

In [5]:
import re
import emoji

def remove_emoji(text):
    """Removes emojis from the input text."""
    return emoji.replace_emoji(text, replace="")  # Replaces all emojis with an empty string

# Example Usage
print(remove_emoji("Loved the movie. It was 😘😍"))   # Output: 'Loved the movie. It was '
print(remove_emoji("Lmao 😂😁"))                     # Output: 'Lmao '

Loved the movie. It was 
Lmao 


### Alternative solutions

In [6]:
import re

def remove_emoji(text):
    """Removes emojis using regex pattern."""
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # Emoticons
        "\U0001F300-\U0001F5FF"  # Symbols & Pictographs
        "\U0001F680-\U0001F6FF"  # Transport & Map Symbols
        "\U0001F700-\U0001F77F"  # Alchemical Symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251"  # Enclosed Characters
        "]+", flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)

# Example Usage
print(remove_emoji("Loved the movie. It was 😘😍"))  # Output: 'Loved the movie. It was '
print(remove_emoji("Lmao 😂😁"))                    # Output: 'Lmao '

Loved the movie. It was 
Lmao 


In [8]:
import emoji
print(emoji.demojize('I love bangladesh 😂'))

I love bangladesh :face_with_tears_of_joy:
