Learning Resources :-

https://colab.research.google.com/github/alvinntnu/python-notes/blob/master/python-basics/regex.ipynb

https://notebook.community/sastels/Onboarding/6%20-%20Regular%20Expressions

<h1>Data Cleaning and Information Extraction Through Regex

In [30]:
text = "🌟Welcome to NLP Lab!🌟 This is an advanced lab assignment that comprises text, numbers like 12345 and 99.99, emojis 😊🚀, special symbols #@&, and multiple email addresses like user1@example.com and user2@university.edu. Visit our website at https://nlp-lab.edu for resources. Our upcoming event is on 2023-09-15, featuring workshops on 'Advanced Sentiment Analysis' and 'Natural Language Generation.' Workshop fees range from $99.99 to $299.99. Contact us at info@nlp-lab.edu for inquiries. Looking forward to seeing you there! #NLP #TextProcessing"


<h4>Task 1: Remove Emojis, Special Symbols, and Non-Alphanumeric Characters</h4>
Apply regular expressions to thoroughly clean the text by removing emojis, special symbols, and all non-alphanumeric characters except spaces. Present the cleaned text after the removal.

Expected Output - Welcome to NLP Lab This is an advanced lab assignment that comprises text numbers like 12345 and 9999 emojis  special symbols  and multiple email addresses like user1examplecom and user2universityedu Visit our website at httpsnlplabedu for resources Our upcoming event is on 20230915 featuring workshops on Advanced Sentiment Analysis and Natural Language Generation Workshop fees range from 9999 to 29999 Contact us at infonlplabedu for inquiries Looking forward to seeing you there NLP TextProcessing

In [31]:
import re

pattern = r"[^\w\s]"

# find all matches
matches = re.findall(pattern,text)
# print(matches)
# print(len(matches))

text_1 = text

for i in matches:
    text_1=text_1.replace(i,'')
    text_1=text_1.replace("  ",' ')
text_1

'Welcome to NLP Lab This is an advanced lab assignment that comprises text numbers like 12345 and 9999 emojis special symbols and multiple email addresses like user1examplecom and user2universityedu Visit our website at httpsnlplabedu for resources Our upcoming event is on 20230915 featuring workshops on Advanced Sentiment Analysis and Natural Language Generation Workshop fees range from 9999 to 29999 Contact us at infonlplabedu for inquiries Looking forward to seeing you there NLP TextProcessing'

<h4>Task 2: Extract Email Addresses and URLs</h4>
Use advanced regular expressions to extract both email addresses and URLs from the cleaned text. Present the extracted email addresses and URLs separately.

Expected Output -

Extracted Email Addresses: ['user1@example.com', 'user2@university.edu', 'info@nlp-lab.edu']

Extracted URLs: ['https://nlp-lab.edu']



In [32]:
pattern_email = r'[\w\d._%+-]+' + "@" + r"[\w\d._%+-]" + "." + r"[\w\d._%+-]+"

# find all email matches
matches_email = re.findall(pattern_email,text)
print(f"Extracted Email Addresses: {matches_email}")


pattern_url = r'[\w]+' + "://" + r"[\w\d$-_.+!*'(),/&?=:%]" + "." + r"[\w\d$-_.+!*'(),/&?=:%]+"

# find all email matches
matches_url = re.findall(pattern_url,text)
print(f"Extracted URLs: {matches_url}")

Extracted Email Addresses: ['user1@example.com', 'user2@university.edu.', 'info@nlp-lab.edu']
Extracted URLs: ['https://nlp-lab.edu']


<h4>Task 3: Extract Dates and Monetary Values</h4>
Use complex regular expressions to extract dates (e.g., 2023-09-15) and monetary values (e.g., $99.99) from the cleaned text. Display the extracted dates and monetary values.

Expected Output :

Extracted Dates:['2023-09-15']

Extracted Monetary Values:['$99.99', '$299.99']

In [33]:
pattern_date = r"\d{4}-\d{2}-\d{2}"
matches_date = re.findall(pattern_date,text)
print(f"Extracted Dates: {matches_date}")

pattern_money =  r"(?<=\$)\d+\.\d{2}"
matches_money = re.findall(pattern_money,text)
print(f"Extracted Monetary Values: {matches_money}")

Extracted Dates: ['2023-09-15']
Extracted Monetary Values: ['99.99', '299.99']


<h4>Task 4: Clean Workshop Titles and Hashtags</h4>
Utilize regular expressions to clean the workshop titles ("Advanced Sentiment Analysis" and "Natural Language Generation") and remove hashtags (e.g., #NLP, #TextProcessing). Present the cleaned workshop titles and the text without hashtags.

Expected Output:

Cleaned Workshop Titles:['Advanced Sentiment Analysis', 'Natural Language Generation.']

Text without Hashtags:

🌟Welcome to NLP Lab!🌟 This is an advanced lab assignment that comprises text, numbers like 12345 and 99.99, emojis 😊🚀, special symbols #@&, and multiple email addresses like user1@example.com and user2@university.edu. Visit our website at https://nlp-lab.edu for resources. Our upcoming event is on 2023-09-15, featuring workshops on "Advanced Sentiment Analysis" and "Natural Language Generation." Workshop fees range from $99.99 to $299.99. Contact us at info@nlp-lab.edu for inquiries. Looking forward to seeing you there!


In [34]:
pattern_hashtag = r"#(\w+)"
text_wo_hashtag = re.sub(pattern_hashtag, "", text)

print(f"Text without Hashtags:\n{text_wo_hashtag}")

Text without Hashtags:
🌟Welcome to NLP Lab!🌟 This is an advanced lab assignment that comprises text, numbers like 12345 and 99.99, emojis 😊🚀, special symbols #@&, and multiple email addresses like user1@example.com and user2@university.edu. Visit our website at https://nlp-lab.edu for resources. Our upcoming event is on 2023-09-15, featuring workshops on 'Advanced Sentiment Analysis' and 'Natural Language Generation.' Workshop fees range from $99.99 to $299.99. Contact us at info@nlp-lab.edu for inquiries. Looking forward to seeing you there!  


<h4>Task 5: Log Parsing and Information Extraction</h4>
Given the sample log entries from a web server, use regular expressions to extract IP addresses, timestamps, and HTTP response codes.

Expected Output:

ip_address = ['192.168.1.10','10.0.0.5','172.16.0.4']


In [35]:
log_entries = [
    "192.168.1.10 - - [15/Oct/2023:14:45:29 +0000] 'GET /index.html HTTP/1.1' 200 4536",
    "10.0.0.5 - - [15/Oct/2023:14:46:01 +0000] 'POST /login.php HTTP/1.1' 401 3521",
    "172.16.0.4 - - [15/Oct/2023:14:47:15 +0000] 'GET /dashboard.php HTTP/1.1' 200 5120"
]


pattern_ip = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"

matches_ip = []

# find all ip matches
for i in log_entries:
    matches_ip.extend(re.findall(pattern_ip,i))
print(f"ip_address = {matches_ip}")

ip_address = ['192.168.1.10', '10.0.0.5', '172.16.0.4']


<h4>Task 6: Network Traffic Analysis</h4>
Given the sample network traffic dump, extract URLs, IP addresses, and user agents.

Expected Output:<br>
url = ['http://example.com/page1.html','http://malicious-site.com/malware.php','http://example.com/resources.js']<br>
user_agent = ['Mozilla/5.0','MaliciousBot/1.0','Mozilla/5.0']


In [29]:
network_dump = [
    "GET http://example.com/page1.html User-Agent: Mozilla/5.0",
    "POST http://malicious-site.com/malware.php User-Agent: MaliciousBot/1.0",
    "GET http://example.com/resources.js User-Agent: Mozilla/5.0"
]
pattern_url = r'[\w]+' + "://" + r"[\w\d$-_.+!*'(),/&?=:%]" + "." + r"[\w\d$-_.+!*'(),/&?=:%]+"
matches_url = []

for url in network_dump:
    matches_url.extend(re.findall(pattern_url,url))
print(f"url = {matches_url}")

pattern_ua = r'User-Agent: ([\w./]+)'
matches_ua = []

for ua in network_dump:
    matches_ua.extend(re.findall(pattern_ua,ua))
print(f"user_agent = {matches_ua}")

url = ['http://example.com/page1.html', 'http://malicious-site.com/malware.php', 'http://example.com/resources.js']
user_agent = ['Mozilla/5.0', 'MaliciousBot/1.0', 'Mozilla/5.0']
