<a href="https://colab.research.google.com/github/uumair327/natural_language_processing/blob/main/NLP04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NLP04**

**Import necessary libraries**

In [21]:
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet


**Download NLTK data**

In [22]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Sample text for analysis**

In [23]:
text = """
    Contact us at support@example.com on 12th August 2024 at 10:30 AM.
    The total cost is $1,500.50. System number is SYS1234-ABCD.
"""

**Morphological Analysis: Inflectional and Derivational**

In [24]:
def morphological_analysis(word):
    lem = WordNetLemmatizer()
    stem = PorterStemmer()
    lemmatized_word = lem.lemmatize(word)
    stemmed_word = stem.stem(word)
    return lemmatized_word, stemmed_word

**Tokenize text**

In [25]:
tokens = nltk.word_tokenize(text)

**Analyze each word**

In [26]:
for word in tokens:
    lemmatized_word, stemmed_word = morphological_analysis(word)
    print(f"Word: {word}, Lemmatized: {lemmatized_word}, Stemmed: {stemmed_word}")

Word: Contact, Lemmatized: Contact, Stemmed: contact
Word: us, Lemmatized: u, Stemmed: us
Word: at, Lemmatized: at, Stemmed: at
Word: support, Lemmatized: support, Stemmed: support
Word: @, Lemmatized: @, Stemmed: @
Word: example.com, Lemmatized: example.com, Stemmed: example.com
Word: on, Lemmatized: on, Stemmed: on
Word: 12th, Lemmatized: 12th, Stemmed: 12th
Word: August, Lemmatized: August, Stemmed: august
Word: 2024, Lemmatized: 2024, Stemmed: 2024
Word: at, Lemmatized: at, Stemmed: at
Word: 10:30, Lemmatized: 10:30, Stemmed: 10:30
Word: AM, Lemmatized: AM, Stemmed: am
Word: ., Lemmatized: ., Stemmed: .
Word: The, Lemmatized: The, Stemmed: the
Word: total, Lemmatized: total, Stemmed: total
Word: cost, Lemmatized: cost, Stemmed: cost
Word: is, Lemmatized: is, Stemmed: is
Word: $, Lemmatized: $, Stemmed: $
Word: 1,500.50, Lemmatized: 1,500.50, Stemmed: 1,500.50
Word: ., Lemmatized: ., Stemmed: .
Word: System, Lemmatized: System, Stemmed: system
Word: number, Lemmatized: number, Stemm

**Regular Expressions to detect and extract**

In [27]:
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
datetime_pattern = r'\b\d{1,2}(?:th|st|nd|rd)?\s+\w+\s+\d{4}\s+at\s+\d{1,2}:\d{2}\s+[APMapm]{2}\b'
currency_pattern = r'\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?'
system_number_pattern = r'SYS\d{4}-[A-Z]{4}'

**Extract email IDs**

In [28]:
emails = re.findall(email_pattern, text)
print("\nEmail IDs found:")
for email in emails:
    print(email)


Email IDs found:
support@example.com


**Extract date-time values**

In [29]:
date_times = re.findall(datetime_pattern, text)
print("\nDate-Time values found:")
for date_time in date_times:
    print(date_time)


Date-Time values found:
12th August 2024 at 10:30 AM


**Extract currency amounts**

In [30]:
currencies = re.findall(currency_pattern, text)
print("\nCurrency amounts found:")
for currency in currencies:
    print(currency)


Currency amounts found:
$1,500.50


**Extract system numbers**

In [31]:
system_numbers = re.findall(system_number_pattern, text)
print("\nSystem numbers found:")
for system_number in system_numbers:
    print(system_number)


System numbers found:
SYS1234-ABCD
