# 🔹 Step 5: Handling Emojis, URLs, and Hashtags

Concept: Clean social media and user-generated text.

✅ We’ll learn:

Regex for removing or replacing URLs

Converting emojis to text (using emoji library)

Handling hashtags and mentions


This step is part of text normalization — especially useful for social media, reviews, or chat data, where emojis 😊, links 🔗, and hashtags #️⃣ appear often.


Clean your text by removing or transforming:

URLs

Emojis

Hashtags

Mentions (@username)

So that your NLP model focuses only on meaningful textual content.

# 1️⃣ Removing URL

In [3]:
import re

text = "Check this out: https://openai.com or http://example.org or www.worldnet.com website"
clean_text = re.sub(r'http\S+|www\S+', '', text)
print(clean_text)


Check this out:  or  or  website


http\S+ → matches all strings starting with http until whitespace.

www\S+ → matches URLs starting with www.

# 2️⃣ Remove Mentions (@user)

In [5]:
text = "Hey @srilekha, great job on NLP!"
clean_text = re.sub(r'@\w+', '', text)
print(clean_text)


Hey , great job on NLP!


# 3️⃣ Handle Hashtags

# Option 1 - Removing words next to hash# 
Limitations is this removes most of the words 

In [8]:
text = "I love #NaturalLanguageProcessing #aritificialintelligence, #hopeai #aiworld #genai"
clean_text = re.sub(r'#\w+', '', text)
print(clean_text)


I love  ,   


since we dont have any words after removing hastag words we have option 2
# ✅ Option 2 – Keep the hashtag word (recommended for meaning)

In [9]:
clean_text = re.sub(r'#', '', text)
print(clean_text)

I love NaturalLanguageProcessing aritificialintelligence, hopeai aiworld genai


# 4️⃣ Remove Emojis (2 ways)

In [16]:
import re
emoji_pattern = re.compile(
    "["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags
    "]+", flags=re.UNICODE)

text = "I love NLP 😍🚀 I like aritificial intelligence ❤️ Robots 🤖 flag 🏳️‍🌈"
clean_text = emoji_pattern.sub(r'', text)
print(clean_text)



I love NLP  I like aritificial intelligence ❤️ Robots 🤖 flag ️‍


# Best fix is to use emoji python library

In [17]:
# pip install emoji
import emoji

text = "I love NLP 😍🚀 I like aritificial intelligence ❤️ Robots 🤖 flag 🏳️‍🌈"
clean_text = emoji.replace_emoji(text, replace='')
print(clean_text)


I love NLP  I like aritificial intelligence  Robots  flag 


## Convert the emoji into words so that model will understand meaning better if want 
## helpful for sentiment analysis usecase


In [None]:
import emoji


text = "I love NLP 😍🚀 I like artificial intelligence ❤️ Robots 🤖 flag 🏳️‍🌈"
clean_text = emoji.demojize(text)
print(clean_text)