# Assignment 9

Q1. Write a unique paragraph (5-6 sentences) about your favorite topic (e.g., sports,
technology, food, books, etc.).
1. Convert text to lowercase and remove punctuation.
2. Tokenize the text into words and sentences.
3. Remove stopwords (using NLTK's stopwords list).
4. Display word frequency distribuion (excluding stopwords).

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
import nltk
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

paragraph = """As of early May 2025, tensions between India and Pakistan have
escalated significantly following a deadly terrorist attack in Indian-administered
Kashmir on April 22, which killed 26 people, including 25 Hindu tourists and a l
ocal resident. India has attributed the attack to Pakistan-linked militants,
specifically suspects tied to Lashkar-e-Taiba, while Pakistan denies involvement.
In response, India has downgraded diplomatic ties, suspended participation in a
cross-border water treaty, and Ujjwal 102316042 ujjwall743@gmail.com both sides have
exchanged fire across the Line of Control. India’s Prime Minister Narendra Modi
has granted his military operational freedom, and Pakistan claims to have
intelligence indicating a potential Indian strike.
International mediation is currently limited, with the United States calling for
de-escalation but playing a less active role than during previous crises.
The situation remains volatile, with fears that any miscalculation could lead to
catastrophic consequences for the region.
"""

print("Original Paragraph:")
print(paragraph)

def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

cleaned_para = clean_text(paragraph)
print("\nCleaned Text:")
print(cleaned_para)

# Tokenize sentences and words
sentences = sent_tokenize(paragraph)
words = word_tokenize(cleaned_para)

print(f"\nNumber of sentences: {len(sentences)}")
print("Sentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence}")

print(f"\nNumber of words: {len(words)}")
print(f"First 15 tokens: {words[:15]}")

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
print(f"\nNumber of words after stopword removal: {len(filtered_words)}")

# Word frequency distribution
fdist = FreqDist(filtered_words)
print("\nWord Frequencies (Top 10):")
for word, frequency in fdist.most_common(10):
    print(f"{word}: {frequency}")


Original Paragraph:
As of early May 2025, tensions between India and Pakistan have 
escalated significantly following a deadly terrorist attack in Indian-administered 
Kashmir on April 22, which killed 26 people, including 25 Hindu tourists and a l
ocal resident. India has attributed the attack to Pakistan-linked militants, 
specifically suspects tied to Lashkar-e-Taiba, while Pakistan denies involvement. 
In response, India has downgraded diplomatic ties, suspended participation in a 
cross-border water treaty, and Ujjwal 102316042 ujjwall743@gmail.com both sides have
exchanged fire across the Line of Control. India’s Prime Minister Narendra Modi 
has granted his military operational freedom, and Pakistan claims to have 
intelligence indicating a potential Indian strike. 
International mediation is currently limited, with the United States calling for 
de-escalation but playing a less active role than during previous crises. 
The situation remains volatile, with fears that any miscalc

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Q2. Stemming and Lemmatization
1. Take the tokenized words from Question 1 (after stopword removal).
2. Apply stemming using NLTK's PorterStemmer and LancasterStemmer.
3. Apply Lemmatization using NLTK's WordNetLemmatizer.
4. Compare and display results of both techniques.

In [15]:
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
import pandas as pd

nltk.download('wordnet')

# Initialize stemmers and lemmatizer
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

# i.
print("First few filtered words:", filtered_words[:10])

# ii.
porter_stemmed = [porter_stemmer.stem(word) for word in filtered_words]
lancaster_stemmed = [lancaster_stemmer.stem(word) for word in filtered_words]

# iii.
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_words]

# iv.
print("\n~~ Comparison ~~")
print(f"{'Original':<15} {'Porter':<15} {'Lancaster':<15} {'Lemmatized':<15}")
print("-" * 60)

for i in range(min(10, len(filtered_words))):  # Limit to first 10 words for simplicity
    print(f"{filtered_words[i]:<15} {porter_stemmed[i]:<15} {lancaster_stemmed[i]:<15} {lemmatized[i]:<15}")

[nltk_data] Downloading package wordnet to /root/nltk_data...


First few filtered words: ['early', 'may', '2025', 'tensions', 'india', 'pakistan', 'escalated', 'significantly', 'following', 'deadly']

~~ Comparison ~~
Original        Porter          Lancaster       Lemmatized     
------------------------------------------------------------
early           earli           ear             early          
may             may             may             may            
2025            2025            2025            2025           
tensions        tension         tend            tension        
india           india           ind             india          
pakistan        pakistan        pak             pakistan       
escalated       escal           esc             escalated      
significantly   significantli   sign            significantly  
following       follow          follow          following      
deadly          deadli          dead            deadly         


Q3. Regular Expressions and Text Splitting
1. Take their original text from Question 1.
2. Use regular expressions to:
   - a. Extract all words with more than 5 letters.
   - b. Extract all numbers (if any exist in their text).
   - c. Extract all capitalized words.
3. Use text splitting techniques to:
   - a. Split the text into words containing only alphabets (removing digits and special characters).
    - b. Extract words starting with a vowel.

In [16]:
import re

# 1.
print(paragraph)

# 2.
long_words = re.findall(r'\b\w{6,}\b', paragraph)
print(f"Words with more than 5 letters: {long_words}")

numbers = re.findall(r'\d+', paragraph)
print(f"All numbers: {numbers}")

capitalized_words = re.findall(r'\b[A-Z][a-z]*\b', paragraph)
print(f"Capitalized words: {capitalized_words}")

# 3.
alpha_only = re.findall(r'\b[a-zA-Z]+\b', paragraph)
print(f"Words with only alphabets: {alpha_only}")

vowel_words = re.findall(r'\b[aeiouAEIOU]\w*\b', paragraph)
print(f"Words starting with a vowel: {vowel_words}")


As of early May 2025, tensions between India and Pakistan have 
escalated significantly following a deadly terrorist attack in Indian-administered 
Kashmir on April 22, which killed 26 people, including 25 Hindu tourists and a l
ocal resident. India has attributed the attack to Pakistan-linked militants, 
specifically suspects tied to Lashkar-e-Taiba, while Pakistan denies involvement. 
In response, India has downgraded diplomatic ties, suspended participation in a 
cross-border water treaty, and Ujjwal 102316042 ujjwall743@gmail.com both sides have
exchanged fire across the Line of Control. India’s Prime Minister Narendra Modi 
has granted his military operational freedom, and Pakistan claims to have 
intelligence indicating a potential Indian strike. 
International mediation is currently limited, with the United States calling for 
de-escalation but playing a less active role than during previous crises. 
The situation remains volatile, with fears that any miscalculation could lead t

**Q4. Custom Tokenization & Regex-based Text Cleaning**
1. Take original text from Question 1.
2. Write a custom tokenization function that:
    a. Removes punctuation and special symbols, but keeps contractions (e.g., "isn't" should not be split into "is" and "n't").
    b. Handles hyphenated words as a single token (e.g., "state-of-the-art" remains a single token).
    c. Tokenizes numbers separately but keeps decimal numbers intact (e.g., "3.14" should remain as is).

3. Use Regex Substitutions (re.sub) to:
    a. Replace email addresses with '<EMAIL>' placeholder.
    b. Replace URLs with '<URL>' placeholder.
    c. Replace phone numbers (formats: 123-456-7890 or +91 9876543210) with '<PHONE>' placeholder.


In [17]:
import re

# 1. Use original paragraph
print("Original Paragraph:")
print(paragraph)
print("-" * 40)

# 2. Custom tokenization
def custom_tokenize(text):
    # Replace email, url, phone
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', '<EMAIL>', text)
    text = re.sub(r'https?://[^\s]+', '<URL>', text)
    text = re.sub(r'\b\d{3}-\d{3}-\d{4}\b', '<PHONE>', text)
    text = re.sub(r'\+\d{1,3}\s\d{10}\b', '<PHONE>', text)

    text = re.sub(r"(\w+)\'(\w+)", r"\1_APOS_\2", text)

    hyphens = re.findall(r'\b[\w]+-[\w-]+\b', text)
    for i, word in enumerate(hyphens):
        text = text.replace(word, f"_HYPHEN_{i}_")

    decimals = re.findall(r'\b\d+\.\d+\b', text)
    for i, num in enumerate(decimals):
        text = text.replace(num, f"_DECIMAL_{i}_")
    text = re.sub(r'[^\w\s<>]', ' ', text)

    tokens = text.split()
    tokens = [t.replace('_APOS_', "'") for t in tokens]
    for i, word in enumerate(hyphens):
        tokens = [word if t == f"_HYPHEN_{i}_" else t for t in tokens]

    for i, num in enumerate(decimals):
        tokens = [num if t == f"_DECIMAL_{i}_" else t for t in tokens]
    return tokens, text

tokens, cleaned_text = custom_tokenize(paragraph)

print("Cleaned text with placeholders:")
print(cleaned_text)
print("Tokens:")
print(tokens)


print("\n")
print("Contractions:", [t for t in tokens if "'" in t])
print("Hyphenated words:", [t for t in tokens if "-" in t])
print("Decimals:", [t for t in tokens if re.match(r'\d+\.\d+', t)])
print("Placeholders:", [t for t in tokens if t in ['<EMAIL>', '<URL>', '<PHONE>']])

Original Paragraph:
As of early May 2025, tensions between India and Pakistan have 
escalated significantly following a deadly terrorist attack in Indian-administered 
Kashmir on April 22, which killed 26 people, including 25 Hindu tourists and a l
ocal resident. India has attributed the attack to Pakistan-linked militants, 
specifically suspects tied to Lashkar-e-Taiba, while Pakistan denies involvement. 
In response, India has downgraded diplomatic ties, suspended participation in a 
cross-border water treaty, and Ujjwal 102316042 ujjwall743@gmail.com both sides have
exchanged fire across the Line of Control. India’s Prime Minister Narendra Modi 
has granted his military operational freedom, and Pakistan claims to have 
intelligence indicating a potential Indian strike. 
International mediation is currently limited, with the United States calling for 
de-escalation but playing a less active role than during previous crises. 
The situation remains volatile, with fears that any miscalc